[jira] [Deleted] (OPENNLP-1297) 17.01.2020

2020-01-16 Thread Koji Sekiguchi (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi deleted OPENNLP-1297:



>  17.01.2020 
> -
>
> Key: OPENNLP-1297
> URL: https://issues.apache.org/jira/browse/OPENNLP-1297
> Project: OpenNLP
>  Issue Type: Dependency upgrade
> Environment:  17.01.2020 
>Reporter: Simon poortman
>Priority: Critical
>  Labels: Majesty
>
>  17.01.2020 
> [simon_poort...@icloud.com|mailto:simon_poort...@icloud.com]
> Majesty
> General



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Deleted] (OPENNLP-1295) 15.01.2020 Simon Poortman

2020-01-14 Thread Koji Sekiguchi (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi deleted OPENNLP-1295:



>  15.01.2020  Simon Poortman
> 
>
> Key: OPENNLP-1295
> URL: https://issues.apache.org/jira/browse/OPENNLP-1295
> Project: OpenNLP
>  Issue Type: Dependency
> Environment:  15.01.2020  Simon Poortman
>Reporter: Simon poortman
>Priority: Major
>  Labels: Majesty, king, secret-elite
>
>  15.01.2020  Simon Poortman
>  
> Majesty
> King
> General



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (OPENNLP-852) CountryContextFile should support multiple regexes

2019-03-31 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated OPENNLP-852:
---
Comment: was deleted

(was: A comment with security level 'Administrators' was removed.)

> CountryContextFile should support multiple regexes
> --
>
> Key: OPENNLP-852
> URL: https://issues.apache.org/jira/browse/OPENNLP-852
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Entity Linker
>Affects Versions: addons-1.6.0
> Environment: windows 7, any
>Reporter: Mark Giaconia
>Assignee: Mark Giaconia
>Priority: Major
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> This will require reindexing all data, and constructing a new file format. 
> This will be a big improvement in terms of recall of general location context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Deleted] (OPENNLP-1251) Simon Poortman19-88

2019-03-29 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi deleted OPENNLP-1251:



> Simon Poortman19-88
> -
>
> Key: OPENNLP-1251
> URL: https://issues.apache.org/jira/browse/OPENNLP-1251
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Simon Poortman
>Priority: Critical
>  Labels: Criminal
>
> Simon is de Beste



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1224) Use Daemon threads in executor services

2018-11-13 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1224.
-
   Resolution: Fixed
Fix Version/s: 1.9.1

> Use Daemon threads in executor services
> ---
>
> Key: OPENNLP-1224
> URL: https://issues.apache.org/jira/browse/OPENNLP-1224
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Edd Spencer
>Assignee: Koji Sekiguchi
>Priority: Major
> Fix For: 1.9.1
>
>
> For all executor services it would be ideal if they are configured to use 
> daemon threads. This will mean that should the process need to be shutdown it 
> will not wait until these threads are complete in order to do so (which can 
> take a long time depending on operation).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1224) Use Daemon threads in executor services

2018-11-13 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1224:
---

Assignee: Koji Sekiguchi

> Use Daemon threads in executor services
> ---
>
> Key: OPENNLP-1224
> URL: https://issues.apache.org/jira/browse/OPENNLP-1224
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Edd Spencer
>Assignee: Koji Sekiguchi
>Priority: Major
>
> For all executor services it would be ideal if they are configured to use 
> daemon threads. This will mean that should the process need to be shutdown it 
> will not wait until these threads are complete in order to do so (which can 
> take a long time depending on operation).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner

2018-10-15 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reopened OPENNLP-1214:
-

> use hash to avoid linear search in DefaultEndOfSentenceScanner
> --
>
> Key: OPENNLP-1214
> URL: https://issues.apache.org/jira/browse/OPENNLP-1214
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to 
> check if each characters in the sentence is one of eos characters. I think 
> we'd better use HashSet to keep eosCharacters instead of char[].
> In accordance with this replacement, I'd like to make 
> getEndOfSentenceCharacters() deprecated because it returns char[] and nobody 
> in OpenNLP calls it at present, and I'd like to add the equivalent method 
> which returns Set of eos chars. Though it cannot keep the order of 
> eos chars but I don't think it can be a problem anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner

2018-10-02 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1214.
-
Resolution: Fixed
  Assignee: Koji Sekiguchi

> use hash to avoid linear search in DefaultEndOfSentenceScanner
> --
>
> Key: OPENNLP-1214
> URL: https://issues.apache.org/jira/browse/OPENNLP-1214
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to 
> check if each characters in the sentence is one of eos characters. I think 
> we'd better use HashSet to keep eosCharacters instead of char[].
> In accordance with this replacement, I'd like to make 
> getEndOfSentenceCharacters() deprecated because it returns char[] and nobody 
> in OpenNLP calls it at present, and I'd like to add the equivalent method 
> which returns Set of eos chars. Though it cannot keep the order of 
> eos chars but I don't think it can be a problem anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1221) FeatureGeneratorUtil.tokenFeature() is too specific for some languages

2018-09-25 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1221:
---

 Summary: FeatureGeneratorUtil.tokenFeature() is too specific for 
some languages
 Key: OPENNLP-1221
 URL: https://issues.apache.org/jira/browse/OPENNLP-1221
 Project: OpenNLP
  Issue Type: Improvement
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi


As I described in OPENNLP-1197, in Japanese NER problem, we usually use only 
DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER 
for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present 
are too specific. I don't need to distinguish among lc (lowercase alphabet), ac 
(all capital letters) and ic (initial capital letter), for example.

By way of trial, if I applied the following patch in order to avoid "too 
specific token class generation":

{code}
diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
 
b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
index e6b8af95..405938d1 100644
--- 
a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
+++ 
b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
@@ -29,6 +29,8 @@ public class FeatureGeneratorUtil {
   private static final String TOKEN_AND_CLASS_PREFIX = "w";
 
   private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$");
+  private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$");
+  private static final Pattern pAlpha = 
Pattern.compile("^\\p{IsAlphabetic}+$");
 
   /**
* Generates a class name for the specified token.
@@ -64,48 +66,11 @@ public class FeatureGeneratorUtil {
 else if (pattern.isAllKatakana()) {
   feat = "jak";
 }
-else if (pattern.isAllLowerCaseLetter()) {
-  feat = "lc";
+else if (pDigit.matcher(token).find()) {
+  feat = "digit";
 }
-else if (pattern.digits() == 2) {
-  feat = "2d";
-}
-else if (pattern.digits() == 4) {
-  feat = "4d";
-}
-else if (pattern.containsDigit()) {
-  if (pattern.containsLetters()) {
-feat = "an";
-  }
-  else if (pattern.containsHyphen()) {
-feat = "dd";
-  }
-  else if (pattern.containsSlash()) {
-feat = "ds";
-  }
-  else if (pattern.containsComma()) {
-feat = "dc";
-  }
-  else if (pattern.containsPeriod()) {
-feat = "dp";
-  }
-  else {
-feat = "num";
-  }
-}
-else if (pattern.isAllCapitalLetter()) {
-  if (token.length() == 1) {
-feat = "sc";
-  }
-  else {
-feat = "ac";
-  }
-}
-else if (capPeriod.matcher(token).find()) {
-  feat = "cp";
-}
-else if (pattern.isInitialCapitalLetter()) {
-  feat = "ic";
+else if (pAlpha.matcher(token).find()) {
+  feat = "alpha";
 }
 else {
   feat = "other";
{code}

total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think I 
have a lot of room yet to tune and increase the performance.

Fortunately, I could add japanese-addon project to opennlp-addons in the 
previous ticket, I'd like to add some programs that generate simpler token 
classes in japanese-addon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1201) add bailout way for certain languages in order to use POS features

2018-09-25 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1201.
-
Resolution: Fixed
  Assignee: Koji Sekiguchi

This feature has been added to opennlp-addons. Thanks!

> add bailout way for certain languages in order to use POS features
> --
>
> Key: OPENNLP-1201
> URL: https://issues.apache.org/jira/browse/OPENNLP-1201
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Command Line Interface, Formats
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Major
>
> As OpenNLP tools depend on the fact that text being processed needs to be 
> tokenized in advance (in other words, words in the text are separated each 
> other by space), it is difficult for uses who use certain languages (e.g. 
> CJK) to use POS (Part-of-Speech) features.
> To simplify the explanation, consider using NameFinder for Japanese text. In 
> NameFinder tools (Train, Eval, Recognize), they require that users should 
> provide Japanese text which has already been tokenized, but once we tokenize 
> Japanese text, it loses POS information. (I think Chinese language has same 
> problem)
> Let me describe this problem for western language users :) (English, French, 
> Italian, etc.) without using Japanese letters. I’ll try to use English 
> alphabet, instead.
> Suppose you have a sentence text “isentthemachine” which you want to give 
> NameFinder, you use morphological analyzer in order to tokenize the sentence. 
> There are two possible sequence of tokens:
> - i (PPSS) / sent (VBD) / the (AT) / machine (NP)
> - i (PPSS) / sent (VBD) / them (PPO) / a (AT) / chine (NP)
> As you noticed, morphological analyzer not only tokenizes the sentence, but 
> also tags POS tag to each token. Same thing takes place in Japanese language 
> (and Chinese language, I think).
> However, in OpenNLP feature generator API, it accepts sequence of tokens thru 
> API i.e. `String[] tokens`, I cannot produce POS feature in the feature 
> generator.
> To solve this problem (and to invite many users to our community), I’d like 
> to suggest that OpenNLP tools allow users to add optional information to each 
> tokenized word.
> For example, one can give the following text when using NameFinder tools.
> {code}
> $ cat en-ner.train
> I/PPSS sent/VBD the/AT machine/NP
> {code}
> When using such text, they must inform the tool that the token has POS tag in 
> the text by using a certain option e.g. -postag
> {code}
> $ opennlp TokenNameFinderTrainer -data en-ner.train -model en-ner.bin -postag
> {code}
> We can maintain the backward compatibility to set -postag false by default 
> and in this case, existing feature generators work exactly the same as 
> before. If a user set -postag option in the command line, the existing 
> feature generators eliminate “/POS” part of token “word/POS” in the text so 
> that they can produce same features as before.
> I’d like to add a simple feature generator which generates only “POS” part of 
> token “word/POS” in the text, in addition to managing -postag option. This 
> simple feature generator allows Japanese/Chinese users to produce precise POS 
> features.
> I’d like to focus on NameFinder in this ticket (Let me add this option to 
> other tools (chunker, classifier, etc.) in another ticket, if needed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1219) change private instance variable featureGenerators to protected in DefaultNameContextGenerator

2018-09-19 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1219.
-
   Resolution: Fixed
Fix Version/s: 1.9.1

> change private instance variable featureGenerators to protected in 
> DefaultNameContextGenerator
> --
>
> Key: OPENNLP-1219
> URL: https://issues.apache.org/jira/browse/OPENNLP-1219
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> TokenNameFinderTrainer allows users to customize TokenNameFinderFactory via 
> -factory option. As I want to override 
> DefaultNameContextGenerator.getContext(), I made the sub-class of 
> TokenNameFinderFactory and created an instance of the sub-class of 
> DefaultNameContextGenerator in the constructor of my TokenNameFinderFactory. 
> However, I couldn't implement getContext() method of my 
> DefaultNameContextGenerator because I couldn't access private member 
> featureGenerators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1219) change private instance variable featureGenerators to protected in DefaultNameContextGenerator

2018-09-19 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1219:
---

Assignee: Koji Sekiguchi

> change private instance variable featureGenerators to protected in 
> DefaultNameContextGenerator
> --
>
> Key: OPENNLP-1219
> URL: https://issues.apache.org/jira/browse/OPENNLP-1219
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> TokenNameFinderTrainer allows users to customize TokenNameFinderFactory via 
> -factory option. As I want to override 
> DefaultNameContextGenerator.getContext(), I made the sub-class of 
> TokenNameFinderFactory and created an instance of the sub-class of 
> DefaultNameContextGenerator in the constructor of my TokenNameFinderFactory. 
> However, I couldn't implement getContext() method of my 
> DefaultNameContextGenerator because I couldn't access private member 
> featureGenerators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1219) change private instance variable featureGenerators to protected in DefaultNameContextGenerator

2018-09-19 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated OPENNLP-1219:

Summary: change private instance variable featureGenerators to protected in 
DefaultNameContextGenerator  (was: change instance variable featureGenerators 
to protected in DefaultNameContextGenerator)

> change private instance variable featureGenerators to protected in 
> DefaultNameContextGenerator
> --
>
> Key: OPENNLP-1219
> URL: https://issues.apache.org/jira/browse/OPENNLP-1219
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Priority: Minor
>
> TokenNameFinderTrainer allows users to customize TokenNameFinderFactory via 
> -factory option. As I want to override 
> DefaultNameContextGenerator.getContext(), I made the sub-class of 
> TokenNameFinderFactory and created an instance of the sub-class of 
> DefaultNameContextGenerator in the constructor of my TokenNameFinderFactory. 
> However, I couldn't implement getContext() method of my 
> DefaultNameContextGenerator because I couldn't access private member 
> featureGenerators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1219) change instance variable featureGenerators to protected in DefaultNameContextGenerator

2018-09-19 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1219:
---

 Summary: change instance variable featureGenerators to protected 
in DefaultNameContextGenerator
 Key: OPENNLP-1219
 URL: https://issues.apache.org/jira/browse/OPENNLP-1219
 Project: OpenNLP
  Issue Type: Improvement
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi


TokenNameFinderTrainer allows users to customize TokenNameFinderFactory via 
-factory option. As I want to override 
DefaultNameContextGenerator.getContext(), I made the sub-class of 
TokenNameFinderFactory and created an instance of the sub-class of 
DefaultNameContextGenerator in the constructor of my TokenNameFinderFactory. 
However, I couldn't implement getContext() method of my 
DefaultNameContextGenerator because I couldn't access private member 
featureGenerators.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1216) opennlp command should allow users to set heap size

2018-08-29 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1216.
-
Resolution: Fixed
  Assignee: Koji Sekiguchi

> opennlp command should allow users to set heap size
> ---
>
> Key: OPENNLP-1216
> URL: https://issues.apache.org/jira/browse/OPENNLP-1216
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Command Line Interface
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> When I used ParserTrainer, I got OutOfMemoryError. I checked opennlp shell 
> script, I found uses cannot change the heap size without editing the script.
> I think we should allow uses to set it by doing like this:
> {code}
> $ JAVA_HEAP=4096m opennlp ParserTrainer ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1217) opennlp Parser can take only one model file

2018-08-29 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1217.
-
Resolution: Fixed
  Assignee: Koji Sekiguchi

> opennlp Parser can take only one model file
> ---
>
> Key: OPENNLP-1217
> URL: https://issues.apache.org/jira/browse/OPENNLP-1217
> Project: OpenNLP
>  Issue Type: Documentation
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.9.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1215) ParserTrainer's option -head-rules in the document should be -headRules

2018-08-29 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1215.
-
Resolution: Fixed
  Assignee: Koji Sekiguchi

> ParserTrainer's option -head-rules in the document should be -headRules
> ---
>
> Key: OPENNLP-1215
> URL: https://issues.apache.org/jira/browse/OPENNLP-1215
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.9.1
>
>
> There is the section that describes so and I tried to execute:
> {code}
> opennlp ParserTrainer -model en-parser.bin -data en-parser.train -head-rules 
> opennlp-tools/lang/en/parser/en-head_rules -lang en
> {code}
> and I got the error `Missing mandatory parameter: -headRules`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1217) opennlp Parser can take only one model file

2018-08-29 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1217:
---

 Summary: opennlp Parser can take only one model file
 Key: OPENNLP-1217
 URL: https://issues.apache.org/jira/browse/OPENNLP-1217
 Project: OpenNLP
  Issue Type: Documentation
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi
 Fix For: 1.9.1






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1216) opennlp command should allow users to set heap size

2018-08-28 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1216:
---

 Summary: opennlp command should allow users to set heap size
 Key: OPENNLP-1216
 URL: https://issues.apache.org/jira/browse/OPENNLP-1216
 Project: OpenNLP
  Issue Type: Documentation
  Components: Command Line Interface
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi
 Fix For: 1.9.1


When I used ParserTrainer, I got OutOfMemoryError. I checked opennlp shell 
script, I found uses cannot change the heap size without editing the script.

I think we should allow uses to set it by doing like this:

{code}
$ JAVA_HEAP=4096m opennlp ParserTrainer ...
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1215) ParserTrainer's option -head-rules in the document should be -headRules

2018-08-28 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1215:
---

 Summary: ParserTrainer's option -head-rules in the document should 
be -headRules
 Key: OPENNLP-1215
 URL: https://issues.apache.org/jira/browse/OPENNLP-1215
 Project: OpenNLP
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi
 Fix For: 1.9.1


There is the section that describes so and I tried to execute:

{code}
opennlp ParserTrainer -model en-parser.bin -data en-parser.train -head-rules 
opennlp-tools/lang/en/parser/en-head_rules -lang en
{code}

and I got the error `Missing mandatory parameter: -headRules`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1213) Use ja for Japanese language code rather than jp

2018-08-24 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1213.
-
Resolution: Fixed

> Use ja for Japanese language code rather than jp
> 
>
> Key: OPENNLP-1213
> URL: https://issues.apache.org/jira/browse/OPENNLP-1213
> Project: OpenNLP
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> It seems that Factory of sentdetect uses "jp" for Japanese language code but 
> I think it is country code. Let's use "ja" instead.
> We could leave "jp" for back-compat, but I don't think we need to do it. So 
> I'll just replace "jp" with "ja" in the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1213) Use ja for Japanese language code rather than jp

2018-08-24 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1213:
---

Assignee: Koji Sekiguchi

> Use ja for Japanese language code rather than jp
> 
>
> Key: OPENNLP-1213
> URL: https://issues.apache.org/jira/browse/OPENNLP-1213
> Project: OpenNLP
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> It seems that Factory of sentdetect uses "jp" for Japanese language code but 
> I think it is country code. Let's use "ja" instead.
> We could leave "jp" for back-compat, but I don't think we need to do it. So 
> I'll just replace "jp" with "ja" in the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1212) TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag

2018-08-13 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1212.
-
Resolution: Fixed

> TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag
> ---
>
> Key: OPENNLP-1212
> URL: https://issues.apache.org/jira/browse/OPENNLP-1212
> Project: OpenNLP
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> As TokenFeatureGenerator can accept lowercase flag but 
> TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag, 
> TokenFeatureGenerator always return lowercase tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1211) Improve WindowFeatureGeneratorTest

2018-08-13 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1211.
-
Resolution: Fixed

> Improve WindowFeatureGeneratorTest
> --
>
> Key: OPENNLP-1211
> URL: https://issues.apache.org/jira/browse/OPENNLP-1211
> Project: OpenNLP
>  Issue Type: Test
>  Components: Build, Packaging and Test
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.9.1
>
>
> I'd like to improve WindowFeatureGeneratorTest from the following perspective:
> * testWindowSizeOne should check the contents of the returned features. It 
> checks the length of the features only now
> * most of test methods uses Assert.assertEquals(expected, actual) in opposite 
> way for its arguments when checking the contents of the returned features
> {code}
> Assert.assertEquals(features.get(0), testSentence[testTokenIndex]);
> {code}
> should be
> {code}
> Assert.assertEquals(testSentence[testTokenIndex], features.get(0));
> {code}
> * Though I pointed out the arguments in assertEquals() above, I think we'd 
> better use exact concrete string rather than expression such like 
> testSentence[testTokenIndex] for the expected. And also, 
> testForCorrectFeatures uses contains method when checking the contents of the 
> returned features but I think we should avoid using contains when checking 
> the items in a List, rather than writing like this:
> {code}
> Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
> "2" +
> testSentence[testTokenIndex - 2]));
> Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
> "1" +
> testSentence[testTokenIndex - 1]));
> Assert.assertTrue(features.contains(testSentence[testTokenIndex]));
> Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
> "1" +
> testSentence[testTokenIndex + 1]));
> Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
> "2" +
> testSentence[testTokenIndex + 2]));
> {code}
> but I'd like to rewrite them like this:
> {code}
> Assert.assertEquals("d",features.get(0));
> Assert.assertEquals("p1c",features.get(1));
> Assert.assertEquals("p2b",features.get(2));
> Assert.assertEquals("n1e",features.get(3));
> Assert.assertEquals("n2f",features.get(4));
> {code}
> The second form helps us to understand how WindowFeatureGenerator works and 
> it's easier to read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1211) Improve WindowFeatureGeneratorTest

2018-08-13 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1211:
---

Assignee: Koji Sekiguchi

> Improve WindowFeatureGeneratorTest
> --
>
> Key: OPENNLP-1211
> URL: https://issues.apache.org/jira/browse/OPENNLP-1211
> Project: OpenNLP
>  Issue Type: Test
>  Components: Build, Packaging and Test
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.9.1
>
>
> I'd like to improve WindowFeatureGeneratorTest from the following perspective:
> * testWindowSizeOne should check the contents of the returned features. It 
> checks the length of the features only now
> * most of test methods uses Assert.assertEquals(expected, actual) in opposite 
> way for its arguments when checking the contents of the returned features
> {code}
> Assert.assertEquals(features.get(0), testSentence[testTokenIndex]);
> {code}
> should be
> {code}
> Assert.assertEquals(testSentence[testTokenIndex], features.get(0));
> {code}
> * Though I pointed out the arguments in assertEquals() above, I think we'd 
> better use exact concrete string rather than expression such like 
> testSentence[testTokenIndex] for the expected. And also, 
> testForCorrectFeatures uses contains method when checking the contents of the 
> returned features but I think we should avoid using contains when checking 
> the items in a List, rather than writing like this:
> {code}
> Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
> "2" +
> testSentence[testTokenIndex - 2]));
> Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
> "1" +
> testSentence[testTokenIndex - 1]));
> Assert.assertTrue(features.contains(testSentence[testTokenIndex]));
> Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
> "1" +
> testSentence[testTokenIndex + 1]));
> Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
> "2" +
> testSentence[testTokenIndex + 2]));
> {code}
> but I'd like to rewrite them like this:
> {code}
> Assert.assertEquals("d",features.get(0));
> Assert.assertEquals("p1c",features.get(1));
> Assert.assertEquals("p2b",features.get(2));
> Assert.assertEquals("n1e",features.get(3));
> Assert.assertEquals("n2f",features.get(4));
> {code}
> The second form helps us to understand how WindowFeatureGenerator works and 
> it's easier to read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1212) TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag

2018-08-13 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1212:
---

Assignee: Koji Sekiguchi

> TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag
> ---
>
> Key: OPENNLP-1212
> URL: https://issues.apache.org/jira/browse/OPENNLP-1212
> Project: OpenNLP
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> As TokenFeatureGenerator can accept lowercase flag but 
> TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag, 
> TokenFeatureGenerator always return lowercase tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner

2018-08-13 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1214:
---

 Summary: use hash to avoid linear search in 
DefaultEndOfSentenceScanner
 Key: OPENNLP-1214
 URL: https://issues.apache.org/jira/browse/OPENNLP-1214
 Project: OpenNLP
  Issue Type: Improvement
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi
 Fix For: 1.9.1


When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to 
check if each characters in the sentence is one of eos characters. I think we'd 
better use HashSet to keep eosCharacters instead of char[].

In accordance with this replacement, I'd like to make 
getEndOfSentenceCharacters() deprecated because it returns char[] and nobody in 
OpenNLP calls it at present, and I'd like to add the equivalent method which 
returns Set of eos chars. Though it cannot keep the order of eos 
chars but I don't think it can be a problem anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1213) Use ja for Japanese language code rather than jp

2018-08-13 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1213:
---

 Summary: Use ja for Japanese language code rather than jp
 Key: OPENNLP-1213
 URL: https://issues.apache.org/jira/browse/OPENNLP-1213
 Project: OpenNLP
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi
 Fix For: 1.9.1


It seems that Factory of sentdetect uses "jp" for Japanese language code but I 
think it is country code. Let's use "ja" instead.

We could leave "jp" for back-compat, but I don't think we need to do it. So 
I'll just replace "jp" with "ja" in the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1206) add TrigramNameFeatureGeneratorFactory

2018-08-11 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated OPENNLP-1206:

Fix Version/s: 1.9.1

> add TrigramNameFeatureGeneratorFactory
> --
>
> Key: OPENNLP-1206
> URL: https://issues.apache.org/jira/browse/OPENNLP-1206
> Project: OpenNLP
>  Issue Type: Task
>  Components: Machine Learning
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> Surprisingly, it's missing. :) I noticed it when I tried to use it in my 
> feature generator XML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1212) TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag

2018-08-10 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1212:
---

 Summary: TokenFeatureGeneratorFactory doesn't allow us to set 
lowercase flag
 Key: OPENNLP-1212
 URL: https://issues.apache.org/jira/browse/OPENNLP-1212
 Project: OpenNLP
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi
 Fix For: 1.9.1


As TokenFeatureGenerator can accept lowercase flag but 
TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag, 
TokenFeatureGenerator always return lowercase tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1211) Improve WindowFeatureGeneratorTest

2018-08-10 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated OPENNLP-1211:

Description: 
I'd like to improve WindowFeatureGeneratorTest from the following perspective:

* testWindowSizeOne should check the contents of the returned features. It 
checks the length of the features only now
* most of test methods uses Assert.assertEquals(expected, actual) in opposite 
way for its arguments when checking the contents of the returned features

{code}
Assert.assertEquals(features.get(0), testSentence[testTokenIndex]);
{code}

should be

{code}
Assert.assertEquals(testSentence[testTokenIndex], features.get(0));
{code}

* Though I pointed out the arguments in assertEquals() above, I think we'd 
better use exact concrete string rather than expression such like 
testSentence[testTokenIndex] for the expected. And also, testForCorrectFeatures 
uses contains method when checking the contents of the returned features but I 
think we should avoid using contains when checking the items in a List, rather 
than writing like this:

{code}
Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
"2" +
testSentence[testTokenIndex - 2]));
Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
"1" +
testSentence[testTokenIndex - 1]));

Assert.assertTrue(features.contains(testSentence[testTokenIndex]));

Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
"1" +
testSentence[testTokenIndex + 1]));
Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
"2" +
testSentence[testTokenIndex + 2]));
{code}

but I'd like to rewrite them like this:

{code}
Assert.assertEquals("d",features.get(0));
Assert.assertEquals("p1c",features.get(1));
Assert.assertEquals("p2b",features.get(2));
Assert.assertEquals("n1e",features.get(3));
Assert.assertEquals("n2f",features.get(4));
{code}

The second form helps us to understand how WindowFeatureGenerator works and 
it's easier to read.

  was:
I'd like to improve WindowFeatureGeneratorTest from the following perspective:

* testWindowSizeOne should check the contents of the returned features. It 
checks the length of the features only now
* most of test methods uses Assert.assertEquals(expected, actual) in opposite 
way for its arguments when checking the contents of the returned features

{code}
Assert.assertEquals(features.get(0), testSentence[testTokenIndex]);
{code}

should be

{code}
Assert.assertEquals(testSentence[testTokenIndex], features.get(0));
{code}

* Though I pointed out the arguments in assertEquals() above, I think we'd 
better use exact concrete string rather than expression such like 
testSentence[testTokenIndex] for the expected. And also, testForCorrectFeatures 
uses contains method when checking the contents of the returned features but I 
think we should avoid using contains when checking the items in a List, rather 
than writing like this:

{code}
Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
"2" +
testSentence[testTokenIndex - 2]));
Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
"1" +
testSentence[testTokenIndex - 1]));

Assert.assertTrue(features.contains(testSentence[testTokenIndex]));

Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
"1" +
testSentence[testTokenIndex + 1]));
Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
"2" +
testSentence[testTokenIndex + 2]));
{code}

but I'd like to rewrite them like this:

{code}
Assert.assertTrue("d",features.get(0));
Assert.assertTrue("p1c",features.get(1));
Assert.assertTrue("p2b",features.get(2));
Assert.assertTrue("n1e",features.get(3));
Assert.assertTrue("n2f",features.get(4));
{code}

The second form helps us to understand how WindowFeatureGenerator works and 
it's easier to read.


> Improve WindowFeatureGeneratorTest
> --
>
> Key: OPENNLP-1211
> URL: https://issues.apache.org/jira/browse/OPENNLP-1211
> Project: OpenNLP
>  Issue Type: Test
>  Components: Build, Packaging and Test
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.9.1
>
>
> I'd like to improve WindowFeatureGeneratorTest from the following perspective:
> * testWindowSizeOne should check the contents of the returned features. It 
> checks the length of the features only now
> * most of test methods uses Assert.assertEquals(expected, actual) in opposite 
> way for its arguments when checking the contents of the returned features
> {code}
> Assert.assertEquals(features.get(0), testSentence[testTokenIndex]);
> {code}
> should be
> {code}
> 

[jira] [Created] (OPENNLP-1211) Improve WindowFeatureGeneratorTest

2018-08-10 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1211:
---

 Summary: Improve WindowFeatureGeneratorTest
 Key: OPENNLP-1211
 URL: https://issues.apache.org/jira/browse/OPENNLP-1211
 Project: OpenNLP
  Issue Type: Test
  Components: Build, Packaging and Test
Affects Versions: 1.9.0
Reporter: Koji Sekiguchi
 Fix For: 1.9.1


I'd like to improve WindowFeatureGeneratorTest from the following perspective:

* testWindowSizeOne should check the contents of the returned features. It 
checks the length of the features only now
* most of test methods uses Assert.assertEquals(expected, actual) in opposite 
way for its arguments when checking the contents of the returned features

{code}
Assert.assertEquals(features.get(0), testSentence[testTokenIndex]);
{code}

should be

{code}
Assert.assertEquals(testSentence[testTokenIndex], features.get(0));
{code}

* Though I pointed out the arguments in assertEquals() above, I think we'd 
better use exact concrete string rather than expression such like 
testSentence[testTokenIndex] for the expected. And also, testForCorrectFeatures 
uses contains method when checking the contents of the returned features but I 
think we should avoid using contains when checking the items in a List, rather 
than writing like this:

{code}
Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
"2" +
testSentence[testTokenIndex - 2]));
Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + 
"1" +
testSentence[testTokenIndex - 1]));

Assert.assertTrue(features.contains(testSentence[testTokenIndex]));

Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
"1" +
testSentence[testTokenIndex + 1]));
Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + 
"2" +
testSentence[testTokenIndex + 2]));
{code}

but I'd like to rewrite them like this:

{code}
Assert.assertTrue("d",features.get(0));
Assert.assertTrue("p1c",features.get(1));
Assert.assertTrue("p2b",features.get(2));
Assert.assertTrue("n1e",features.get(3));
Assert.assertTrue("n2f",features.get(4));
{code}

The second form helps us to understand how WindowFeatureGenerator works and 
it's easier to read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1210) Outdated documentation on -lang argument?

2018-07-31 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1210.
-
   Resolution: Fixed
 Assignee: Koji Sekiguchi
Fix Version/s: 1.9.1

Thanks  Xiang Ji! :)

> Outdated documentation on -lang argument?
> -
>
> Key: OPENNLP-1210
> URL: https://issues.apache.org/jira/browse/OPENNLP-1210
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Xiang Ji
>Assignee: Koji Sekiguchi
>Priority: Major
> Fix For: 1.9.1
>
>
> I encountered "Unsupported language: en" error when I was trying to run the 
> `TokenNameFinderConverter` or the `{{TokenNameFinderTrainer}}`.
>  
> I'm not sure if I understood the bug correctly but it seems that after 2 
> hours of trying, I found out that apparently in a certain version after 
> `1.5.3`, OpenNLP changed the language codes from two characters to three 
> characters, i.e. one should have passed in `eng` instead of `en`. But the 
> documentation was never updated on this and no meaningful error message was 
> given (i.e. the program didn't suggest "supported languages" instead).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1206) add TrigramNameFeatureGeneratorFactory

2018-07-10 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1206.
-
Resolution: Fixed

> add TrigramNameFeatureGeneratorFactory
> --
>
> Key: OPENNLP-1206
> URL: https://issues.apache.org/jira/browse/OPENNLP-1206
> Project: OpenNLP
>  Issue Type: Task
>  Components: Machine Learning
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> Surprisingly, it's missing. :) I noticed it when I tried to use it in my 
> feature generator XML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1206) add TrigramNameFeatureGeneratorFactory

2018-06-30 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1206:
---

 Summary: add TrigramNameFeatureGeneratorFactory
 Key: OPENNLP-1206
 URL: https://issues.apache.org/jira/browse/OPENNLP-1206
 Project: OpenNLP
  Issue Type: Task
  Components: Machine Learning
Affects Versions: 1.8.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi


Surprisingly, it's missing. :) I noticed it when I tried to use it in my 
feature generator XML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (OPENNLP-1205) use new XML format of feature generator in OntoNotes4NameFinderEval

2018-06-27 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi closed OPENNLP-1205.
---
Resolution: Invalid

I'm sorry I saw 1.8 source. This has been done already. Closing as invalid.

> use new XML format of feature generator in OntoNotes4NameFinderEval
> ---
>
> Key: OPENNLP-1205
> URL: https://issues.apache.org/jira/browse/OPENNLP-1205
> Project: OpenNLP
>  Issue Type: Task
>  Components: Build, Packaging and Test
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1205) use new XML format of feature generator in OntoNotes4NameFinderEval

2018-06-27 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1205:
---

 Summary: use new XML format of feature generator in 
OntoNotes4NameFinderEval
 Key: OPENNLP-1205
 URL: https://issues.apache.org/jira/browse/OPENNLP-1205
 Project: OpenNLP
  Issue Type: Task
  Components: Build, Packaging and Test
Affects Versions: 1.8.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words

2018-06-27 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1197.
-
   Resolution: Fixed
Fix Version/s: 1.9.0

> FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words
> --
>
> Key: OPENNLP-1197
> URL: https://issues.apache.org/jira/browse/OPENNLP-1197
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Machine Learning
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Major
> Fix For: 1.9.0
>
>
> FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" 
> (lower case). It looks a bug to me because they're not lower case letters, 
> but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes 
> care only Europe/American languages.
> For example, in Japanese NER problem, typical token classes are as follows:
> - DIGIT
> - HIRA : あ, い, う, え, お etc.
> - KATA : ア, イ, ウ, エ, オ etc.
> - ALPHA : we don't need to distinguish lower/upper case
> - OTHER
> I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have 
> additional token classes I mentioned above, but later on, someone who comes 
> from Asia and may claim similar thing.
> I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words

2018-06-26 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reopened OPENNLP-1197:
-

After applying this patch, Eval tests which don't run via mvn test cannot be 
successful. I reopen this and investigate.

> FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words
> --
>
> Key: OPENNLP-1197
> URL: https://issues.apache.org/jira/browse/OPENNLP-1197
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Machine Learning
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Major
>
> FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" 
> (lower case). It looks a bug to me because they're not lower case letters, 
> but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes 
> care only Europe/American languages.
> For example, in Japanese NER problem, typical token classes are as follows:
> - DIGIT
> - HIRA : あ, い, う, え, お etc.
> - KATA : ア, イ, ウ, エ, オ etc.
> - ALPHA : we don't need to distinguish lower/upper case
> - OTHER
> I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have 
> additional token classes I mentioned above, but later on, someone who comes 
> from Asia and may claim similar thing.
> I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1201) add bailout way for certain languages in order to use POS features

2018-06-04 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1201:
---

 Summary: add bailout way for certain languages in order to use POS 
features
 Key: OPENNLP-1201
 URL: https://issues.apache.org/jira/browse/OPENNLP-1201
 Project: OpenNLP
  Issue Type: Improvement
  Components: Command Line Interface, Formats
Affects Versions: 1.8.4
Reporter: Koji Sekiguchi


As OpenNLP tools depend on the fact that text being processed needs to be 
tokenized in advance (in other words, words in the text are separated each 
other by space), it is difficult for uses who use certain languages (e.g. CJK) 
to use POS (Part-of-Speech) features.

To simplify the explanation, consider using NameFinder for Japanese text. In 
NameFinder tools (Train, Eval, Recognize), they require that users should 
provide Japanese text which has already been tokenized, but once we tokenize 
Japanese text, it loses POS information. (I think Chinese language has same 
problem)

Let me describe this problem for western language users :) (English, French, 
Italian, etc.) without using Japanese letters. I’ll try to use English 
alphabet, instead.

Suppose you have a sentence text “isentthemachine” which you want to give 
NameFinder, you use morphological analyzer in order to tokenize the sentence. 
There are two possible sequence of tokens:

- i (PPSS) / sent (VBD) / the (AT) / machine (NP)

- i (PPSS) / sent (VBD) / them (PPO) / a (AT) / chine (NP)

As you noticed, morphological analyzer not only tokenizes the sentence, but 
also tags POS tag to each token. Same thing takes place in Japanese language 
(and Chinese language, I think).

However, in OpenNLP feature generator API, it accepts sequence of tokens thru 
API i.e. `String[] tokens`, I cannot produce POS feature in the feature 
generator.

To solve this problem (and to invite many users to our community), I’d like to 
suggest that OpenNLP tools allow users to add optional information to each 
tokenized word.

For example, one can give the following text when using NameFinder tools.

{code}
$ cat en-ner.train
I/PPSS sent/VBD the/AT machine/NP
{code}

When using such text, they must inform the tool that the token has POS tag in 
the text by using a certain option e.g. -postag

{code}
$ opennlp TokenNameFinderTrainer -data en-ner.train -model en-ner.bin -postag
{code}

We can maintain the backward compatibility to set -postag false by default and 
in this case, existing feature generators work exactly the same as before. If a 
user set -postag option in the command line, the existing feature generators 
eliminate “/POS” part of token “word/POS” in the text so that they can produce 
same features as before.

I’d like to add a simple feature generator which generates only “POS” part of 
token “word/POS” in the text, in addition to managing -postag option. This 
simple feature generator allows Japanese/Chinese users to produce precise POS 
features.

I’d like to focus on NameFinder in this ticket (Let me add this option to other 
tools (chunker, classifier, etc.) in another ticket, if needed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1199) Correct Loop Bounds for NgramGenerator.generate function

2018-05-22 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1199.
-
Resolution: Fixed

> Correct Loop Bounds for NgramGenerator.generate function
> 
>
> Key: OPENNLP-1199
> URL: https://issues.apache.org/jira/browse/OPENNLP-1199
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Prachi Prakash
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: pull-request-available
>
> A small enhancement to the loop condition of NGramGenerator.generate function 
> which saves a subsequent if condition check. I have also attached the PR link
> [Pull Request|https://github.com/apache/opennlp/pull/318]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1198) add more tests to NGramGeneratorTest

2018-05-22 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1198:
---

Assignee: Koji Sekiguchi

> add more tests to NGramGeneratorTest
> 
>
> Key: OPENNLP-1198
> URL: https://issues.apache.org/jira/browse/OPENNLP-1198
> Project: OpenNLP
>  Issue Type: Test
>  Components: Build, Packaging and Test
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
>
> At present, NGramGeneratorTest has only 2-gram test against the example 
> sentence "This is a sentence". I think we'd better to have 1-gram, 3-gram and 
> 4-gram test cases for this example sentence.
> In addition, it checks the return values by doing like this:
> {code}
> Assert.assertEquals(3,  ngrams.size());
> Assert.assertTrue(ngrams.contains("This-is"));
> Assert.assertTrue(ngrams.contains("is-a"));
> Assert.assertTrue(ngrams.contains("a-sentence"));
> {code}
> but it cannot check the sequence. I think we should check it like this, 
> instead:
> {code}
> Assert.assertEquals(3,  ngrams.size());
> Assert.assertEquals("This-is", ngrams.get(0));
> Assert.assertEquals("is-a", ngrams.get(1));
> Assert.assertEquals("a-sentence", ngrams.get(2));
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1198) add more tests to NGramGeneratorTest

2018-05-22 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1198.
-
Resolution: Fixed

> add more tests to NGramGeneratorTest
> 
>
> Key: OPENNLP-1198
> URL: https://issues.apache.org/jira/browse/OPENNLP-1198
> Project: OpenNLP
>  Issue Type: Test
>  Components: Build, Packaging and Test
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
>
> At present, NGramGeneratorTest has only 2-gram test against the example 
> sentence "This is a sentence". I think we'd better to have 1-gram, 3-gram and 
> 4-gram test cases for this example sentence.
> In addition, it checks the return values by doing like this:
> {code}
> Assert.assertEquals(3,  ngrams.size());
> Assert.assertTrue(ngrams.contains("This-is"));
> Assert.assertTrue(ngrams.contains("is-a"));
> Assert.assertTrue(ngrams.contains("a-sentence"));
> {code}
> but it cannot check the sequence. I think we should check it like this, 
> instead:
> {code}
> Assert.assertEquals(3,  ngrams.size());
> Assert.assertEquals("This-is", ngrams.get(0));
> Assert.assertEquals("is-a", ngrams.get(1));
> Assert.assertEquals("a-sentence", ngrams.get(2));
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1198) add more tests to NGramGeneratorTest

2018-05-20 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1198:
---

 Summary: add more tests to NGramGeneratorTest
 Key: OPENNLP-1198
 URL: https://issues.apache.org/jira/browse/OPENNLP-1198
 Project: OpenNLP
  Issue Type: Test
  Components: Build, Packaging and Test
Affects Versions: 1.8.4
Reporter: Koji Sekiguchi


At present, NGramGeneratorTest has only 2-gram test against the example 
sentence "This is a sentence". I think we'd better to have 1-gram, 3-gram and 
4-gram test cases for this example sentence.

In addition, it checks the return values by doing like this:

{code}
Assert.assertEquals(3,  ngrams.size());
Assert.assertTrue(ngrams.contains("This-is"));
Assert.assertTrue(ngrams.contains("is-a"));
Assert.assertTrue(ngrams.contains("a-sentence"));
{code}

but it cannot check the sequence. I think we should check it like this, instead:

{code}
Assert.assertEquals(3,  ngrams.size());
Assert.assertEquals("This-is", ngrams.get(0));
Assert.assertEquals("is-a", ngrams.get(1));
Assert.assertEquals("a-sentence", ngrams.get(2));
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words

2018-05-16 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated OPENNLP-1197:

Description: 
FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" 
(lower case). It looks a bug to me because they're not lower case letters, but 
other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care 
only Europe/American languages.

For example, in Japanese NER problem, typical token classes are as follows:

- DIGIT
- HIRA : あ, い, う, え, お etc.
- KATA : ア, イ, ウ, エ, オ etc.
- ALPHA : we don't need to distinguish lower/upper case
- OTHER

I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have 
additional token classes I mentioned above, but later on, someone who comes 
from Asia and may claim similar thing.

I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.

  was:
FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" 
(lower case). It looks a bug to me because they're not lower case letters, but 
other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care 
only Europe/American languages.

For example, in Japanese NER problem, typical token classes are as follows:

- DIGIT
- HIRA : あ, い, う, え, お etc.
- KATA : ア, イ, ウ, エ, オ etc.
- ALPHA : we don't distinguish lower/upper case
- OTHER

I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have 
additional token classes I mentioned above, but later on, someone who comes 
from Asia and may claim similar thing.

I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.


> FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words
> --
>
> Key: OPENNLP-1197
> URL: https://issues.apache.org/jira/browse/OPENNLP-1197
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Machine Learning
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Priority: Major
>
> FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" 
> (lower case). It looks a bug to me because they're not lower case letters, 
> but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes 
> care only Europe/American languages.
> For example, in Japanese NER problem, typical token classes are as follows:
> - DIGIT
> - HIRA : あ, い, う, え, お etc.
> - KATA : ア, イ, ウ, エ, オ etc.
> - ALPHA : we don't need to distinguish lower/upper case
> - OTHER
> I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have 
> additional token classes I mentioned above, but later on, someone who comes 
> from Asia and may claim similar thing.
> I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words

2018-05-15 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1197:
---

 Summary: FeatureGeneratorUtil.tokenFeature() always returns "lc" 
for Japanese words
 Key: OPENNLP-1197
 URL: https://issues.apache.org/jira/browse/OPENNLP-1197
 Project: OpenNLP
  Issue Type: Bug
  Components: Machine Learning
Affects Versions: 1.8.4
Reporter: Koji Sekiguchi


FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" 
(lower case). It looks a bug to me because they're not lower case letters, but 
other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care 
only Europe/American languages.

For example, in Japanese NER problem, typical token classes are as follows:

- DIGIT
- HIRA : あ, い, う, え, お etc.
- KATA : ア, イ, ウ, エ, オ etc.
- ALPHA : we don't distinguish lower/upper case
- OTHER

I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have 
additional token classes I mentioned above, but later on, someone who comes 
from Asia and may claim similar thing.

I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1195) use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer and NaiveBayesTrainer

2018-05-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1195.
-
Resolution: Fixed

> use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer 
> and NaiveBayesTrainer
> 
>
> Key: OPENNLP-1195
> URL: https://issues.apache.org/jira/browse/OPENNLP-1195
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
>
> PerceptronTrainer and NaiveBayesTrainer have their own private maxIndex() 
> method and they are identical.
> Why don't we move it to their parent class?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1195) use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer and NaiveBayesTrainer

2018-05-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated OPENNLP-1195:

Summary: use ArrayMath.argmax() rather than private maxIndex() in 
PerceptronTrainer and NaiveBayesTrainer  (was: move maxIndex method to 
AbstractEventTrainer)

> use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer 
> and NaiveBayesTrainer
> 
>
> Key: OPENNLP-1195
> URL: https://issues.apache.org/jira/browse/OPENNLP-1195
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
>
> PerceptronTrainer and NaiveBayesTrainer have their own private maxIndex() 
> method and they are identical.
> Why don't we move it to their parent class?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1195) move maxIndex method to AbstractEventTrainer

2018-05-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1195:
---

Assignee: Koji Sekiguchi

> move maxIndex method to AbstractEventTrainer
> 
>
> Key: OPENNLP-1195
> URL: https://issues.apache.org/jira/browse/OPENNLP-1195
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
>
> PerceptronTrainer and NaiveBayesTrainer have their own private maxIndex() 
> method and they are identical.
> Why don't we move it to their parent class?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1196) move ArrayMath to a more general package

2018-05-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1196.
-
Resolution: Fixed

> move ArrayMath to a more general package
> 
>
> Key: OPENNLP-1196
> URL: https://issues.apache.org/jira/browse/OPENNLP-1196
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> In OPENNLP-1195, [~joern] mentioned this.
> {quote}
> There are more usages of argmax in the OpenNLP source code.
> I propose we create one common method and then try to only use that one.
> We could move the ArrayMath to a more general package and place a common 
> method there, or keep the existing one
> {quote}
> I want to solve this before OPENNLP-1195.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1196) move ArrayMath to a more general package

2018-05-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1196:
---

Assignee: Koji Sekiguchi

> move ArrayMath to a more general package
> 
>
> Key: OPENNLP-1196
> URL: https://issues.apache.org/jira/browse/OPENNLP-1196
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.8.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> In OPENNLP-1195, [~joern] mentioned this.
> {quote}
> There are more usages of argmax in the OpenNLP source code.
> I propose we create one common method and then try to only use that one.
> We could move the ArrayMath to a more general package and place a common 
> method there, or keep the existing one
> {quote}
> I want to solve this before OPENNLP-1195.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1196) move ArrayMath to a more general package

2018-05-14 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1196:
---

 Summary: move ArrayMath to a more general package
 Key: OPENNLP-1196
 URL: https://issues.apache.org/jira/browse/OPENNLP-1196
 Project: OpenNLP
  Issue Type: Improvement
  Components: Machine Learning
Affects Versions: 1.8.4
Reporter: Koji Sekiguchi


In OPENNLP-1195, [~joern] mentioned this.

{quote}
There are more usages of argmax in the OpenNLP source code.
I propose we create one common method and then try to only use that one.

We could move the ArrayMath to a more general package and place a common method 
there, or keep the existing one
{quote}

I want to solve this before OPENNLP-1195.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1195) move maxIndex method to AbstractEventTrainer

2018-05-14 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1195:
---

 Summary: move maxIndex method to AbstractEventTrainer
 Key: OPENNLP-1195
 URL: https://issues.apache.org/jira/browse/OPENNLP-1195
 Project: OpenNLP
  Issue Type: Improvement
Affects Versions: 1.8.4
Reporter: Koji Sekiguchi


PerceptronTrainer and NaiveBayesTrainer have their own private maxIndex() 
method and they are identical.

Why don't we move it to their parent class?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (OPENNLP-1160) avoid letting users specify CachedFeatureGeneratorFactory in XML config

2018-01-11 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1160.
-
Resolution: Fixed

> avoid letting users specify CachedFeatureGeneratorFactory in XML config
> ---
>
> Key: OPENNLP-1160
> URL: https://issues.apache.org/jira/browse/OPENNLP-1160
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats, Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> This is similar to OPENNLP-1159. When I'm working on OPENNLP-1154, I think we 
> should do it for better use.
> I'd like to implement this as an independent ticket from OPENNLP-1154 and 
> OPENNLP-1159 to make patch easy to read.
> And this ticket is somewhat different from OPENNLP-1159 as users must be able 
> to control the framework uses CachedFeatureGeneratorFactory or not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1159) avoid letting users specify AggregatedFeatureGeneratorFactory in XML config

2018-01-08 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1159.
-
Resolution: Fixed

> avoid letting users specify AggregatedFeatureGeneratorFactory in XML config
> ---
>
> Key: OPENNLP-1159
> URL: https://issues.apache.org/jira/browse/OPENNLP-1159
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats, Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> When I'm working on OPENNLP-1154, I think we should do it for better use.
> I'd like to implement this as an independent ticket from OPENNLP-1154 to make 
> patch easy to read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1160) avoid letting users specify CachedFeatureGeneratorFactory in XML config

2017-12-29 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306169#comment-16306169
 ] 

Koji Sekiguchi commented on OPENNLP-1160:
-

I'll suggest adding `cache` attribute in the most top tag:

{code:xml}
...
{code}


> avoid letting users specify CachedFeatureGeneratorFactory in XML config
> ---
>
> Key: OPENNLP-1160
> URL: https://issues.apache.org/jira/browse/OPENNLP-1160
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats, Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> This is similar to OPENNLP-1159. When I'm working on OPENNLP-1154, I think we 
> should do it for better use.
> I'd like to implement this as an independent ticket from OPENNLP-1154 and 
> OPENNLP-1159 to make patch easy to read.
> And this ticket is somewhat different from OPENNLP-1159 as users must be able 
> to control the framework uses CachedFeatureGeneratorFactory or not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1175) explain the new format of feature generator XML config

2017-12-29 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1175:
---

 Summary: explain the new format of feature generator XML config
 Key: OPENNLP-1175
 URL: https://issues.apache.org/jira/browse/OPENNLP-1175
 Project: OpenNLP
  Issue Type: Bug
  Components: Documentation
Reporter: Koji Sekiguchi
Priority: Minor


Document should explain the new format of feature generator XML config, rather 
than classic format.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger

2017-12-26 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1154.
-
Resolution: Fixed

> change the XML format for feature generator config in NameFinder and POS 
> Tagger
> ---
>
> Key: OPENNLP-1154
> URL: https://issues.apache.org/jira/browse/OPENNLP-1154
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>
> NameFinder provides many kinds of feature generator (factories). Users can 
> define their config via XML which looks like:
> {code:xml}
> 
>
> 
> 
> 
>   
>   
> 
>   
>   
>   
>   
>   
> 
>
> 
> {code}
> If a user wants to implement their own feature generator, he can use  .../>, but if he wants to have two or more feature generators at once, he may 
> be able to implement it by providing a wrapper feature generator which wraps 
> two or more feature generators that he originally wants to have, but it is 
> not good.
> I'd like to suggest that we make the config format more flexible like below:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
> 
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
>   
> 
>   
> 
> {code}
> If ... is too noisy, I'm thinking another format as well:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1171) some tests create temp files and directories but never delete them

2017-12-19 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1171.
-
Resolution: Fixed

> some tests create temp files and directories but never delete them
> --
>
> Key: OPENNLP-1171
> URL: https://issues.apache.org/jira/browse/OPENNLP-1171
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Build, Packaging and Test
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.8.4
>
>
> Some temporary files and directories that are created in some tests are never 
> deleted and the number of temporary files/directories is increasing after 
> running mvn clean test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1171) some tests create temp files and directories but never delete them

2017-12-18 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1171:
---

 Summary: some tests create temp files and directories but never 
delete them
 Key: OPENNLP-1171
 URL: https://issues.apache.org/jira/browse/OPENNLP-1171
 Project: OpenNLP
  Issue Type: Bug
  Components: Build, Packaging and Test
Affects Versions: 1.8.3
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 1.8.4


Some temporary files and directories that are created in some tests are never 
deleted and the number of temporary files/directories is increasing after 
running mvn clean test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger

2017-12-16 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293972#comment-16293972
 ] 

Koji Sekiguchi commented on OPENNLP-1154:
-

What I did in this patch are:

* move all static *FeatureGeneratorFactory classes out of GeneratorFactory.java 
and make them individual Factory classes such as 
BrownClusterTokenFeatureGeneratorFactory.java, 
BigramNameFeatureGeneratorFactory.java etc. so that users can avoid specifying 
nested class names e.g. 
opennlp.tools.util.featuregen.GeneratorFactory.BigramNameFeatureGeneratorFactory
 in XML config file

* provide AbstractXmlFeatureGeneratorFactory class which all 
*FeatureGeneratorFactory classes must extend. It has init() method that is 
called from framework when XML config file is read. It helps 
*FeatureGeneratorFactory classes to set their parameters if they are specified 
in the nested way like:

{code:xml}
   
  2
  2
  
   
{code}

* *FeatureGeneratorFactory classes can read parameters set in XML config file 
via getter methods e.g. getInt(“parameter name”), getStr(“parameter name”) as 
long as they extend AbstractXmlFeatureGeneratorFactory class. 
AbstractXmlFeatureGeneratorFactory set parameters to 
LinkedHashMap in init() method. Why I used LinkedHashMap not 
HashMap because it must respect the order of written parameters, because 
multiple  can be specified in a parent FeatureGeneratorFactory, 
only AggregatedFeatureGeneratorFactory can support multiple sub-generators now 
though.

* classic format is still supported for back-compat reasons. I provided test 
cases to check both of classic and new formats support. The classic format XML 
files can be found with *_classic.xml file name under src/test/resources 
folder. GeneratorFactory recognizes which format is used in createGenerator() 
method.

* extractArtifactSerializerMappings() method can support both classic and new 
formats.
* 

> change the XML format for feature generator config in NameFinder and POS 
> Tagger
> ---
>
> Key: OPENNLP-1154
> URL: https://issues.apache.org/jira/browse/OPENNLP-1154
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>
> NameFinder provides many kinds of feature generator (factories). Users can 
> define their config via XML which looks like:
> {code:xml}
> 
>
> 
> 
> 
>   
>   
> 
>   
>   
>   
>   
>   
> 
>
> 
> {code}
> If a user wants to implement their own feature generator, he can use  .../>, but if he wants to have two or more feature generators at once, he may 
> be able to implement it by providing a wrapper feature generator which wraps 
> two or more feature generators that he originally wants to have, but it is 
> not good.
> I'd like to suggest that we make the config format more flexible like below:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
> 
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
>   
> 
>   
> 
> {code}
> If ... is too noisy, I'm thinking another format as well:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger

2017-12-16 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293972#comment-16293972
 ] 

Koji Sekiguchi edited comment on OPENNLP-1154 at 12/17/17 12:18 AM:


What I did in this patch are:

* move all static *FeatureGeneratorFactory classes out of GeneratorFactory.java 
and make them individual Factory classes such as 
BrownClusterTokenFeatureGeneratorFactory.java, 
BigramNameFeatureGeneratorFactory.java etc. so that users can avoid specifying 
nested class names e.g. 
opennlp.tools.util.featuregen.GeneratorFactory.BigramNameFeatureGeneratorFactory
 in XML config file

* provide AbstractXmlFeatureGeneratorFactory class which all 
*FeatureGeneratorFactory classes must extend. It has init() method that is 
called from framework when XML config file is read. It helps 
*FeatureGeneratorFactory classes to set their parameters if they are specified 
in the nested way like:

{code:xml}
   
  2
  2
  
   
{code}

* *FeatureGeneratorFactory classes can read parameters set in XML config file 
via getter methods e.g. getInt(“parameter name”), getStr(“parameter name”) as 
long as they extend AbstractXmlFeatureGeneratorFactory class. 
AbstractXmlFeatureGeneratorFactory set parameters to 
LinkedHashMap in init() method. Why I used LinkedHashMap not 
HashMap because it must respect the order of written parameters, because 
multiple  can be specified in a parent FeatureGeneratorFactory, 
only AggregatedFeatureGeneratorFactory can support multiple sub-generators now 
though.

* classic format is still supported for back-compat reasons. I provided test 
cases to check both of classic and new formats support. The classic format XML 
files can be found with *_classic.xml file name under src/test/resources 
folder. GeneratorFactory recognizes which format is used in createGenerator() 
method.

* extractArtifactSerializerMappings() method can support both classic and new 
formats.



was (Author: koji):
What I did in this patch are:

* move all static *FeatureGeneratorFactory classes out of GeneratorFactory.java 
and make them individual Factory classes such as 
BrownClusterTokenFeatureGeneratorFactory.java, 
BigramNameFeatureGeneratorFactory.java etc. so that users can avoid specifying 
nested class names e.g. 
opennlp.tools.util.featuregen.GeneratorFactory.BigramNameFeatureGeneratorFactory
 in XML config file

* provide AbstractXmlFeatureGeneratorFactory class which all 
*FeatureGeneratorFactory classes must extend. It has init() method that is 
called from framework when XML config file is read. It helps 
*FeatureGeneratorFactory classes to set their parameters if they are specified 
in the nested way like:

{code:xml}
   
  2
  2
  
   
{code}

* *FeatureGeneratorFactory classes can read parameters set in XML config file 
via getter methods e.g. getInt(“parameter name”), getStr(“parameter name”) as 
long as they extend AbstractXmlFeatureGeneratorFactory class. 
AbstractXmlFeatureGeneratorFactory set parameters to 
LinkedHashMap in init() method. Why I used LinkedHashMap not 
HashMap because it must respect the order of written parameters, because 
multiple  can be specified in a parent FeatureGeneratorFactory, 
only AggregatedFeatureGeneratorFactory can support multiple sub-generators now 
though.

* classic format is still supported for back-compat reasons. I provided test 
cases to check both of classic and new formats support. The classic format XML 
files can be found with *_classic.xml file name under src/test/resources 
folder. GeneratorFactory recognizes which format is used in createGenerator() 
method.

* extractArtifactSerializerMappings() method can support both classic and new 
formats.
* 

> change the XML format for feature generator config in NameFinder and POS 
> Tagger
> ---
>
> Key: OPENNLP-1154
> URL: https://issues.apache.org/jira/browse/OPENNLP-1154
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>
> NameFinder provides many kinds of feature generator (factories). Users can 
> define their config via XML which looks like:
> {code:xml}
> 
>
> 
> 
> 
>   
>   
> 
>   
>   
>   
>   
>   
> 
>
> 
> {code}
> If a user wants to implement their own feature generator, he can use  .../>, but if he wants to have two or more feature generators at once, he may 
> be able to implement it by providing a wrapper feature generator which wraps 
> two or more feature generators that he originally wants to have, but it is 
> not good.
> I'd 

[jira] [Closed] (OPENNLP-1161) avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings()

2017-12-04 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi closed OPENNLP-1161.
---
Resolution: Won't Fix

The suggested solution can be implemented in the blocked issue, OPENNLP-1154.

> avoid using concrete tag names of XML config in 
> GeneratorFactory.extractArtifactSerializerMappings()
> 
>
> Key: OPENNLP-1161
> URL: https://issues.apache.org/jira/browse/OPENNLP-1161
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats, Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Blocker
>
> When working on OPENNLP-1154, I noticed this.
> In GeneratorFactory.extractArtifactSerializerMappings(), it specifies 
> concrete XML tag names:
> {code:java}
> for (int i = 0; i < allElements.getLength(); i++) {
>   if (allElements.item(i) instanceof Element) {
> Element xmlElement = (Element) allElements.item(i);
> String dictName = xmlElement.getAttribute("dict");
> if (dictName != null) {
>   switch (xmlElement.getTagName()) {
> case "wordcluster":
>   mapping.put(dictName, new 
> WordClusterDictionary.WordClusterDictionarySerializer());
>   break;
> case "brownclustertoken":
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "brownclustertokenclass"://, ;
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "brownclusterbigram": //, ;
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "dictionary":
>   mapping.put(dictName, new DictionarySerializer());
>   break;
>   }
> }
> String modelName = xmlElement.getAttribute("model");
> if (modelName != null) {
>   switch (xmlElement.getTagName()) {
> case "tokenpos":
>   mapping.put(modelName, new POSModelSerializer());
>   break;
>   }
> }
>   }
> }
> {code}
> Instead, we'd better let FeatureGeneratorFactories implement a method that 
> returns mapping (Map) and in 
> GeneratorFactory.extractArtifactSerializerMappings(), the framework just 
> calls the method of FeatureGeneratorFactories, which are found in XML config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1161) avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings()

2017-12-04 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277921#comment-16277921
 ] 

Koji Sekiguchi commented on OPENNLP-1161:
-

I made the patch by posting PR 292 but I (and one of committers) don't like it 
because generators have to return the information of the artifact serializer 
mappings.

As I suggested in this ticket, we should let FeatureGeneratorFactories (not 
generators) implement a method that returns mapping (Map) and in 
GeneratorFactory.extractArtifactSerializerMappings(), the framework just calls 
the method of FeatureGeneratorFactories, which are found in XML config, but I 
couldn't implement this because I needed to keep back-compat.

I'll withdraw this. Instead, I think I can achieve this in the blocked issue, 
OPENNLP-1154.

> avoid using concrete tag names of XML config in 
> GeneratorFactory.extractArtifactSerializerMappings()
> 
>
> Key: OPENNLP-1161
> URL: https://issues.apache.org/jira/browse/OPENNLP-1161
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats, Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Blocker
>
> When working on OPENNLP-1154, I noticed this.
> In GeneratorFactory.extractArtifactSerializerMappings(), it specifies 
> concrete XML tag names:
> {code:java}
> for (int i = 0; i < allElements.getLength(); i++) {
>   if (allElements.item(i) instanceof Element) {
> Element xmlElement = (Element) allElements.item(i);
> String dictName = xmlElement.getAttribute("dict");
> if (dictName != null) {
>   switch (xmlElement.getTagName()) {
> case "wordcluster":
>   mapping.put(dictName, new 
> WordClusterDictionary.WordClusterDictionarySerializer());
>   break;
> case "brownclustertoken":
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "brownclustertokenclass"://, ;
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "brownclusterbigram": //, ;
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "dictionary":
>   mapping.put(dictName, new DictionarySerializer());
>   break;
>   }
> }
> String modelName = xmlElement.getAttribute("model");
> if (modelName != null) {
>   switch (xmlElement.getTagName()) {
> case "tokenpos":
>   mapping.put(modelName, new POSModelSerializer());
>   break;
>   }
> }
>   }
> }
> {code}
> Instead, we'd better let FeatureGeneratorFactories implement a method that 
> returns mapping (Map) and in 
> GeneratorFactory.extractArtifactSerializerMappings(), the framework just 
> calls the method of FeatureGeneratorFactories, which are found in XML config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1161) avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings()

2017-11-30 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273875#comment-16273875
 ] 

Koji Sekiguchi commented on OPENNLP-1161:
-

This is a blocker of OPENNLP-1154 because in OPENNLP-1154, I try to change the 
XML format from classic to new one. And the current implementation in 
GeneratorFactory.extractArtifactSerializerMappings() depends on the classic 
format.

> avoid using concrete tag names of XML config in 
> GeneratorFactory.extractArtifactSerializerMappings()
> 
>
> Key: OPENNLP-1161
> URL: https://issues.apache.org/jira/browse/OPENNLP-1161
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats, Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Blocker
>
> When working on OPENNLP-1154, I noticed this.
> In GeneratorFactory.extractArtifactSerializerMappings(), it specifies 
> concrete XML tag names:
> {code:java}
> for (int i = 0; i < allElements.getLength(); i++) {
>   if (allElements.item(i) instanceof Element) {
> Element xmlElement = (Element) allElements.item(i);
> String dictName = xmlElement.getAttribute("dict");
> if (dictName != null) {
>   switch (xmlElement.getTagName()) {
> case "wordcluster":
>   mapping.put(dictName, new 
> WordClusterDictionary.WordClusterDictionarySerializer());
>   break;
> case "brownclustertoken":
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "brownclustertokenclass"://, ;
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "brownclusterbigram": //, ;
>   mapping.put(dictName, new 
> BrownCluster.BrownClusterSerializer());
>   break;
> case "dictionary":
>   mapping.put(dictName, new DictionarySerializer());
>   break;
>   }
> }
> String modelName = xmlElement.getAttribute("model");
> if (modelName != null) {
>   switch (xmlElement.getTagName()) {
> case "tokenpos":
>   mapping.put(modelName, new POSModelSerializer());
>   break;
>   }
> }
>   }
> }
> {code}
> Instead, we'd better let FeatureGeneratorFactories implement a method that 
> returns mapping (Map) and in 
> GeneratorFactory.extractArtifactSerializerMappings(), the framework just 
> calls the method of FeatureGeneratorFactories, which are found in XML config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1161) avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings()

2017-11-30 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1161:
---

 Summary: avoid using concrete tag names of XML config in 
GeneratorFactory.extractArtifactSerializerMappings()
 Key: OPENNLP-1161
 URL: https://issues.apache.org/jira/browse/OPENNLP-1161
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats, Name Finder
Affects Versions: 1.8.3
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Blocker


When working on OPENNLP-1154, I noticed this.

In GeneratorFactory.extractArtifactSerializerMappings(), it specifies concrete 
XML tag names:

{code:java}
for (int i = 0; i < allElements.getLength(); i++) {
  if (allElements.item(i) instanceof Element) {
Element xmlElement = (Element) allElements.item(i);

String dictName = xmlElement.getAttribute("dict");
if (dictName != null) {

  switch (xmlElement.getTagName()) {
case "wordcluster":
  mapping.put(dictName, new 
WordClusterDictionary.WordClusterDictionarySerializer());
  break;

case "brownclustertoken":
  mapping.put(dictName, new BrownCluster.BrownClusterSerializer());
  break;

case "brownclustertokenclass"://, ;
  mapping.put(dictName, new BrownCluster.BrownClusterSerializer());
  break;

case "brownclusterbigram": //, ;
  mapping.put(dictName, new BrownCluster.BrownClusterSerializer());
  break;

case "dictionary":
  mapping.put(dictName, new DictionarySerializer());
  break;
  }
}

String modelName = xmlElement.getAttribute("model");
if (modelName != null) {

  switch (xmlElement.getTagName()) {
case "tokenpos":
  mapping.put(modelName, new POSModelSerializer());
  break;
  }
}
  }
}
{code}

Instead, we'd better let FeatureGeneratorFactories implement a method that 
returns mapping (Map) and in 
GeneratorFactory.extractArtifactSerializerMappings(), the framework just calls 
the method of FeatureGeneratorFactories, which are found in XML config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1160) avoid letting users specify CachedFeatureGeneratorFactory in XML config

2017-11-28 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269980#comment-16269980
 ] 

Koji Sekiguchi commented on OPENNLP-1160:
-

After committing all of OPENNLP-1154, OPENNLP-1159 and OPENNLP-1160, the XML 
config looks like:

{code:xml}


  2
  2
  


  2
  2
  





  true
  false


{code}


> avoid letting users specify CachedFeatureGeneratorFactory in XML config
> ---
>
> Key: OPENNLP-1160
> URL: https://issues.apache.org/jira/browse/OPENNLP-1160
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats, Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> This is similar to OPENNLP-1159. When I'm working on OPENNLP-1154, I think we 
> should do it for better use.
> I'd like to implement this as an independent ticket from OPENNLP-1154 and 
> OPENNLP-1159 to make patch easy to read.
> And this ticket is somewhat different from OPENNLP-1159 as users must be able 
> to control the framework uses CachedFeatureGeneratorFactory or not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1159) avoid letting users specify AggregatedFeatureGeneratorFactory in XML config

2017-11-28 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269969#comment-16269969
 ] 

Koji Sekiguchi commented on OPENNLP-1159:
-

After committing OPENNLP-1154, the XML config looks like:

{code:xml}

  

  

  2
  2
  


  2
  2
  





  true
  false

  

  

{code}

Then after committing this ticket, the XML config looks like:

{code:xml}



  2
  2
  


  2
  2
  





  true
  false



{code}

CachedFeatureGeneratorFactory should be avoided letting users specify 
explicitly but I prefer to implement it in OPENNLP-1160.

> avoid letting users specify AggregatedFeatureGeneratorFactory in XML config
> ---
>
> Key: OPENNLP-1159
> URL: https://issues.apache.org/jira/browse/OPENNLP-1159
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats, Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> When I'm working on OPENNLP-1154, I think we should do it for better use.
> I'd like to implement this as an independent ticket from OPENNLP-1154 to make 
> patch easy to read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1160) avoid letting users specify CachedFeatureGeneratorFactory in XML config

2017-11-28 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1160:
---

 Summary: avoid letting users specify CachedFeatureGeneratorFactory 
in XML config
 Key: OPENNLP-1160
 URL: https://issues.apache.org/jira/browse/OPENNLP-1160
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats, Name Finder
Affects Versions: 1.8.3
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor


This is similar to OPENNLP-1159. When I'm working on OPENNLP-1154, I think we 
should do it for better use.

I'd like to implement this as an independent ticket from OPENNLP-1154 and 
OPENNLP-1159 to make patch easy to read.

And this ticket is somewhat different from OPENNLP-1159 as users must be able 
to control the framework uses CachedFeatureGeneratorFactory or not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1159) avoid letting users specify AggregatedFeatureGeneratorFactory in XML config

2017-11-28 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1159:
---

 Summary: avoid letting users specify 
AggregatedFeatureGeneratorFactory in XML config
 Key: OPENNLP-1159
 URL: https://issues.apache.org/jira/browse/OPENNLP-1159
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats, Name Finder
Affects Versions: 1.8.3
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor


When I'm working on OPENNLP-1154, I think we should do it for better use.

I'd like to implement this as an independent ticket from OPENNLP-1154 to make 
patch easy to read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger

2017-11-17 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256750#comment-16256750
 ] 

Koji Sekiguchi commented on OPENNLP-1154:
-

As Joern suggested, this should be used for not only NameFinder but also POS 
Tagger, I added "POS Tagger" to the title.

> change the XML format for feature generator config in NameFinder and POS 
> Tagger
> ---
>
> Key: OPENNLP-1154
> URL: https://issues.apache.org/jira/browse/OPENNLP-1154
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>
> NameFinder provides many kinds of feature generator (factories). Users can 
> define their config via XML which looks like:
> {code:xml}
> 
>
> 
> 
> 
>   
>   
> 
>   
>   
>   
>   
>   
> 
>
> 
> {code}
> If a user wants to implement their own feature generator, he can use  .../>, but if he wants to have two or more feature generators at once, he may 
> be able to implement it by providing a wrapper feature generator which wraps 
> two or more feature generators that he originally wants to have, but it is 
> not good.
> I'd like to suggest that we make the config format more flexible like below:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
> 
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
>   
> 
>   
> 
> {code}
> If ... is too noisy, I'm thinking another format as well:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger

2017-11-17 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated OPENNLP-1154:

Summary: change the XML format for feature generator config in NameFinder 
and POS Tagger  (was: change the XML format for feature generator config in 
NameFinder)

> change the XML format for feature generator config in NameFinder and POS 
> Tagger
> ---
>
> Key: OPENNLP-1154
> URL: https://issues.apache.org/jira/browse/OPENNLP-1154
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>
> NameFinder provides many kinds of feature generator (factories). Users can 
> define their config via XML which looks like:
> {code:xml}
> 
>
> 
> 
> 
>   
>   
> 
>   
>   
>   
>   
>   
> 
>
> 
> {code}
> If a user wants to implement their own feature generator, he can use  .../>, but if he wants to have two or more feature generators at once, he may 
> be able to implement it by providing a wrapper feature generator which wraps 
> two or more feature generators that he originally wants to have, but it is 
> not good.
> I'd like to suggest that we make the config format more flexible like below:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
> 
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
>   
> 
>   
> 
> {code}
> If ... is too noisy, I'm thinking another format as well:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (OPENNLP-1154) change the XML format for feature generator config in NameFinder

2017-11-10 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247381#comment-16247381
 ] 

Koji Sekiguchi edited comment on OPENNLP-1154 at 11/10/17 11:58 AM:


I'll post the first patch soon. It fails few tests yet because I didn't care 
about serialize/deserialize for the new format and other details stuff. The 
purpose of posting the first patch, before implementing further 
(serialize/deserialize, test cases, etc.), I'd like to know committers' thought 
about the new format. And also, I think we can support "classic" format for 
back-compat reasons, if needed. In the first patch, I did it, but there are 
many Deprecated annotations due to it. I'd like to know your thought about 
back-compat support as well.

I don't still understand the versioning system in OpenNLP. If we have this new 
format in 1.9, don't I need to consider "classic" format?


was (Author: koji):
I'll post the first patch soon. It fails one test yet because I didn't care 
about serialize/deserialize for the new format. The purpose of posting the 
first patch, before implementing further (serialize/deserialize, test cases, 
etc.), I'd like to know committers' thought about the new format. And also, I 
think we can support "classic" format for back-compat reasons, if needed. In 
the first patch, I did it, but there are many Deprecated annotations due to it. 
I'd like to know your thought about back-compat support as well.

I don't still understand the versioning system in OpenNLP. If we have this new 
format in 1.9, don't I need to consider "classic" format?

> change the XML format for feature generator config in NameFinder
> 
>
> Key: OPENNLP-1154
> URL: https://issues.apache.org/jira/browse/OPENNLP-1154
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>
> NameFinder provides many kinds of feature generator (factories). Users can 
> define their config via XML which looks like:
> {code:xml}
> 
>
> 
> 
> 
>   
>   
> 
>   
>   
>   
>   
>   
> 
>
> 
> {code}
> If a user wants to implement their own feature generator, he can use  .../>, but if he wants to have two or more feature generators at once, he may 
> be able to implement it by providing a wrapper feature generator which wraps 
> two or more feature generators that he originally wants to have, but it is 
> not good.
> I'd like to suggest that we make the config format more flexible like below:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
> 
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
>   
> 
>   
> 
> {code}
> If ... is too noisy, I'm thinking another format as well:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1154) change the XML format for feature generator config in NameFinder

2017-11-10 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247381#comment-16247381
 ] 

Koji Sekiguchi commented on OPENNLP-1154:
-

I'll post the first patch soon. It fails one test yet because I didn't care 
about serialize/deserialize for the new format. The purpose of posting the 
first patch, before implementing further (serialize/deserialize, test cases, 
etc.), I'd like to know committers' thought about the new format. And also, I 
think we can support "classic" format for back-compat reasons, if needed. In 
the first patch, I did it, but there are many Deprecated annotations due to it. 
I'd like to know your thought about back-compat support as well.

I don't still understand the versioning system in OpenNLP. If we have this new 
format in 1.9, don't I need to consider "classic" format?

> change the XML format for feature generator config in NameFinder
> 
>
> Key: OPENNLP-1154
> URL: https://issues.apache.org/jira/browse/OPENNLP-1154
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>
> NameFinder provides many kinds of feature generator (factories). Users can 
> define their config via XML which looks like:
> {code:xml}
> 
>
> 
> 
> 
>   
>   
> 
>   
>   
>   
>   
>   
> 
>
> 
> {code}
> If a user wants to implement their own feature generator, he can use  .../>, but if he wants to have two or more feature generators at once, he may 
> be able to implement it by providing a wrapper feature generator which wraps 
> two or more feature generators that he originally wants to have, but it is 
> not good.
> I'd like to suggest that we make the config format more flexible like below:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>   
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
> 
>  class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
>   
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
>   
> 
>   
> 
> {code}
> If ... is too noisy, I'm thinking another format as well:
> {code:xml}
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory">
>  class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory">
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/>
>   
>class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory">
> 2
> 2
>  class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/>
>   
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1154) change the XML format for feature generator config in NameFinder

2017-11-10 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1154:
---

 Summary: change the XML format for feature generator config in 
NameFinder
 Key: OPENNLP-1154
 URL: https://issues.apache.org/jira/browse/OPENNLP-1154
 Project: OpenNLP
  Issue Type: Improvement
  Components: Name Finder
Affects Versions: 1.8.3
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi


NameFinder provides many kinds of feature generator (factories). Users can 
define their config via XML which looks like:

{code:xml}

   



  
  

  
  
  
  
  

   

{code}

If a user wants to implement their own feature generator, he can use , but if he wants to have two or more feature generators at once, he may 
be able to implement it by providing a wrapper feature generator which wraps 
two or more feature generators that he originally wants to have, but it is not 
good.

I'd like to suggest that we make the config format more flexible like below:

{code:xml}

  

  

  

  
2
2

  


  
2
2

  

  

  

  

{code}

If ... is too noisy, I'm thinking another format as well:

{code:xml}

  

  
2
2

  
  
2
2

  

  

{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (OPENNLP-1149) remove unused member in PlainTextByLineStream

2017-10-24 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1149:
---

Assignee: Koji Sekiguchi

> remove unused member in PlainTextByLineStream
> -
>
> Key: OPENNLP-1149
> URL: https://issues.apache.org/jira/browse/OPENNLP-1149
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.3
>
>
> PlainTextByLineStream has a private member variable "channel" but it is never 
> set and hence, it is always null. It can be removed to simplify code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1149) remove unused member in PlainTextByLineStream

2017-10-24 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1149.
-
Resolution: Fixed

Thanks everyone for reviewing this! :)

> remove unused member in PlainTextByLineStream
> -
>
> Key: OPENNLP-1149
> URL: https://issues.apache.org/jira/browse/OPENNLP-1149
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.3
>
>
> PlainTextByLineStream has a private member variable "channel" but it is never 
> set and hence, it is always null. It can be removed to simplify code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1145) Javadoc of NaiveBayesTrainer class looks incorrect

2017-10-23 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1145.
-
Resolution: Fixed
  Assignee: Koji Sekiguchi

Thanks everyone for reviewing this! :)

> Javadoc of NaiveBayesTrainer class looks incorrect
> --
>
> Key: OPENNLP-1145
> URL: https://issues.apache.org/jira/browse/OPENNLP-1145
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Machine Learning
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.3
>
>
> It seems that Javadoc of NaiveBayesTrainer class was copied from 
> PerceptronTrainer and hence, it says "Trains models using the perceptron 
> algorithm." :)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1150) TokenNameFinderTrainerTool should use ModelUtil.createDefaultTrainingParameters() when mlParams is null

2017-10-23 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1150:
---

 Summary: TokenNameFinderTrainerTool should use 
ModelUtil.createDefaultTrainingParameters() when mlParams is null
 Key: OPENNLP-1150
 URL: https://issues.apache.org/jira/browse/OPENNLP-1150
 Project: OpenNLP
  Issue Type: Improvement
  Components: Name Finder
Affects Versions: 1.8.2
Reporter: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.8.3


Unlike other TrainerTools, TokenNameFinderTrainerTool create an empty 
TrainingParameters when mlParams is null by calling the constructor. 
TokenNameFinderTrainerTool should use 
ModelUtil.createDefaultTrainingParameters() like as other TrainerTools do to 
initialize mlParams.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1149) remove unused member in PlainTextByLineStream

2017-10-22 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214646#comment-16214646
 ] 

Koji Sekiguchi commented on OPENNLP-1149:
-

I'll change the type of private member "encoding" from String to Charset in 
this patch.

> remove unused member in PlainTextByLineStream
> -
>
> Key: OPENNLP-1149
> URL: https://issues.apache.org/jira/browse/OPENNLP-1149
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.3
>
>
> PlainTextByLineStream has a private member variable "channel" but it is never 
> set and hence, it is always null. It can be removed to simplify code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1149) remove unused member in PlainTextByLineStream

2017-10-22 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1149:
---

 Summary: remove unused member in PlainTextByLineStream
 Key: OPENNLP-1149
 URL: https://issues.apache.org/jira/browse/OPENNLP-1149
 Project: OpenNLP
  Issue Type: Improvement
Affects Versions: 1.8.2
Reporter: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.8.3


PlainTextByLineStream has a private member variable "channel" but it is never 
set and hence, it is always null. It can be removed to simplify code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1148) use StandardCharsets.UTF_8 in doc

2017-10-20 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1148:
---

 Summary: use StandardCharsets.UTF_8 in doc
 Key: OPENNLP-1148
 URL: https://issues.apache.org/jira/browse/OPENNLP-1148
 Project: OpenNLP
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.8.2
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.8.3


In the doc, the use of PlainTextByLineStream() is not unified. Other than 
specifying StandardCharsets.UTF_8 for its second parameter, there are following 
variations:

- String "UTF-8"
- StandardCharsets.UTF8 (not UTF_8)
- Charset.forName("UTF-8")

Let's unify the use to StandardCharsets.UTF_8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1147) Missing URLs in doc

2017-10-20 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1147:
---

 Summary: Missing URLs in doc
 Key: OPENNLP-1147
 URL: https://issues.apache.org/jira/browse/OPENNLP-1147
 Project: OpenNLP
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.2
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.8.3


When I read name finder part in document, some missing URLs were there. I'd 
like to correct some of them which I could find latest/alternative ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1146) remove unnecessary serialVersionUID

2017-10-18 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1146.
-
Resolution: Fixed

Thanks all for reviewing this!

> remove unnecessary serialVersionUID
> ---
>
> Key: OPENNLP-1146
> URL: https://issues.apache.org/jira/browse/OPENNLP-1146
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.3
>
>
> We saw several classes that have unnecessary serialVersionUID constant 
> declaration. Most of them are Stemmer classes that are created by the 
> Snowball to Java compiler. I think we can just remove serialVersionUID from 
> Stemmer classes. Other than Stemmer classes, Exception classes which extend 
> RuntimeException or IOException have serialVersionUID. I'll remove 
> serialVersionUID from these Exception classes as well but add 
> @SuppressWarnings("serial") just in case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1146) remove unnecessary serialVersionUID

2017-10-17 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1146:
---

 Summary: remove unnecessary serialVersionUID
 Key: OPENNLP-1146
 URL: https://issues.apache.org/jira/browse/OPENNLP-1146
 Project: OpenNLP
  Issue Type: Improvement
  Components: Build, Packaging and Test
Affects Versions: 1.8.2
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.8.3


We saw several classes that have unnecessary serialVersionUID constant 
declaration. Most of them are Stemmer classes that are created by the Snowball 
to Java compiler. I think we can just remove serialVersionUID from Stemmer 
classes. Other than Stemmer classes, Exception classes which extend 
RuntimeException or IOException have serialVersionUID. I'll remove 
serialVersionUID from these Exception classes as well but add 
@SuppressWarnings("serial") just in case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1141) Add DFA and use it from SequenceCodec.areOutcomesCompatible if possible

2017-10-05 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1141.
-
Resolution: Invalid

> Add DFA and use it from SequenceCodec.areOutcomesCompatible if possible
> ---
>
> Key: OPENNLP-1141
> URL: https://issues.apache.org/jira/browse/OPENNLP-1141
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> BioCodec and BilouCodec implement areOutcomesCompatible(). I think they can 
> be written as DFA (Deterministic Finite Automaton).
> In this ticket, I'll add s simple implementation of DFA and change 
> areOutcomesCompatible() to use it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1141) Add DFA and use it from SequenceCodec.areOutcomesCompatible if possible

2017-10-05 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192675#comment-16192675
 ] 

Koji Sekiguchi commented on OPENNLP-1141:
-

Having a discussion with joern, I learned that we should consider outcomes as a 
set, not sequence. DFA cannot be applied to 
SequenceCodec.areOutcomesCompatible(), but it can be used the sequence 
validators.

I'll withdraw this.

> Add DFA and use it from SequenceCodec.areOutcomesCompatible if possible
> ---
>
> Key: OPENNLP-1141
> URL: https://issues.apache.org/jira/browse/OPENNLP-1141
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> BioCodec and BilouCodec implement areOutcomesCompatible(). I think they can 
> be written as DFA (Deterministic Finite Automaton).
> In this ticket, I'll add s simple implementation of DFA and change 
> areOutcomesCompatible() to use it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1138) Add more tests to Span

2017-10-03 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1138.
-
   Resolution: Fixed
Fix Version/s: 1.8.3

> Add more tests to Span
> --
>
> Key: OPENNLP-1138
> URL: https://issues.apache.org/jira/browse/OPENNLP-1138
> Project: OpenNLP
>  Issue Type: Test
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.3
>
>
> Span's constructor can throw IllegalArgumentException but there is no tests 
> for it. I'll add tests for them and in addition to that, I'll fix the test 
> for toString() because it doesn't test it :) , and I'll remove a redundancy 
> from a constructor. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1139) BilouCodec should use its own constants

2017-10-03 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1139.
-
   Resolution: Fixed
Fix Version/s: 1.8.3

> BilouCodec should use its own constants
> ---
>
> Key: OPENNLP-1139
> URL: https://issues.apache.org/jira/browse/OPENNLP-1139
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.3
>
>
> It seems that BilouCodec accidentally uses BioCodec's constants such as 
> BioCodec.START, BioCodec.CONTINUE, etc. It should use its own ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1139) BilouCodec should use its own constants

2017-10-03 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1139:
---

 Summary: BilouCodec should use its own constants
 Key: OPENNLP-1139
 URL: https://issues.apache.org/jira/browse/OPENNLP-1139
 Project: OpenNLP
  Issue Type: Bug
  Components: Name Finder
Affects Versions: 1.8.2
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial


It seems that BilouCodec accidentally uses BioCodec's constants such as 
BioCodec.START, BioCodec.CONTINUE, etc. It should use its own ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1138) Add more tests to Span

2017-10-02 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1138:
---

 Summary: Add more tests to Span
 Key: OPENNLP-1138
 URL: https://issues.apache.org/jira/browse/OPENNLP-1138
 Project: OpenNLP
  Issue Type: Test
Affects Versions: 1.8.2
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial


Span's constructor can throw IllegalArgumentException but there is no tests for 
it. I'll add tests for them and in addition to that, I'll fix the test for 
toString() because it doesn't test it :) , and I'll remove a redundancy from a 
constructor. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1137) Add more tests and check overlapping of name spans to NameSample

2017-10-02 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1137.
-
   Resolution: Fixed
Fix Version/s: 1.8.3

> Add more tests and check overlapping of name spans to NameSample
> 
>
> Key: OPENNLP-1137
> URL: https://issues.apache.org/jira/browse/OPENNLP-1137
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.3
>
>
> NameSample has the following TODO in its constructor:
> {quote}// TODO: Check that name spans are not overlapping, otherwise throw 
> exception{quote}
> I added simple code for it and its test.
> And I added a test for nested name spans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1137) Add more tests and check overlapping of name spans to NameSample

2017-10-02 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1137:
---

 Summary: Add more tests and check overlapping of name spans to 
NameSample
 Key: OPENNLP-1137
 URL: https://issues.apache.org/jira/browse/OPENNLP-1137
 Project: OpenNLP
  Issue Type: Improvement
  Components: Name Finder
Affects Versions: 1.8.2
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial


NameSample has the following TODO in its constructor:

{quote}// TODO: Check that name spans are not overlapping, otherwise throw 
exception{quote}

I added simple code for it and its test.

And I added a test for nested name spans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1044) Add validate() which checks validity of parameters in the process of the framework

2017-05-07 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1044.
-
   Resolution: Fixed
Fix Version/s: 1.8.0

> Add validate() which checks validity of parameters in the process of the 
> framework
> --
>
> Key: OPENNLP-1044
> URL: https://issues.apache.org/jira/browse/OPENNLP-1044
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.8.0
>
>
> When I worked on OPENNLP-1039, I saw the client codes throw 
> IllegalArgumentException when isValid() returns false, but I think such kind 
> of methods should throw the Exception by themselves and the timing of use 
> should be controlled by the framework.
> So it should look like:
> {code}
> public abstract class AbstractTrainer {
>   @Depracated
>   public boolean isValid() { ... }
>   // if the subclass overrides this, it should call super.validate();
>   public void validate() throws IllegalArgumentException {
> // default implementation here
>   }
>   // this is the controller of the flow of training...
>   public final void train() {
> // initializing 
> init();
> // validating parameters
> validate();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (OPENNLP-1044) Add validate() which checks validity of parameters in the process of the framework

2017-05-07 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi closed OPENNLP-1044.
---

> Add validate() which checks validity of parameters in the process of the 
> framework
> --
>
> Key: OPENNLP-1044
> URL: https://issues.apache.org/jira/browse/OPENNLP-1044
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.8.0
>
>
> When I worked on OPENNLP-1039, I saw the client codes throw 
> IllegalArgumentException when isValid() returns false, but I think such kind 
> of methods should throw the Exception by themselves and the timing of use 
> should be controlled by the framework.
> So it should look like:
> {code}
> public abstract class AbstractTrainer {
>   @Depracated
>   public boolean isValid() { ... }
>   // if the subclass overrides this, it should call super.validate();
>   public void validate() throws IllegalArgumentException {
> // default implementation here
>   }
>   // this is the controller of the flow of training...
>   public final void train() {
> // initializing 
> init();
> // validating parameters
> validate();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (OPENNLP-1044) Add validate() which checks validity of parameters in the process of the framework

2017-05-07 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1044:
---

Assignee: Koji Sekiguchi

> Add validate() which checks validity of parameters in the process of the 
> framework
> --
>
> Key: OPENNLP-1044
> URL: https://issues.apache.org/jira/browse/OPENNLP-1044
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
>
> When I worked on OPENNLP-1039, I saw the client codes throw 
> IllegalArgumentException when isValid() returns false, but I think such kind 
> of methods should throw the Exception by themselves and the timing of use 
> should be controlled by the framework.
> So it should look like:
> {code}
> public abstract class AbstractTrainer {
>   @Depracated
>   public boolean isValid() { ... }
>   // if the subclass overrides this, it should call super.validate();
>   public void validate() throws IllegalArgumentException {
> // default implementation here
>   }
>   // this is the controller of the flow of training...
>   public final void train() {
> // initializing 
> init();
> // validating parameters
> validate();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (OPENNLP-1044) Add validate() which checks validity of parameters in the process of the framework

2017-04-23 Thread Koji Sekiguchi (JIRA)
Koji Sekiguchi created OPENNLP-1044:
---

 Summary: Add validate() which checks validity of parameters in the 
process of the framework
 Key: OPENNLP-1044
 URL: https://issues.apache.org/jira/browse/OPENNLP-1044
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Koji Sekiguchi
Priority: Minor


When I worked on OPENNLP-1039, I saw the client codes throw 
IllegalArgumentException when isValid() returns false, but I think such kind of 
methods should throw the Exception by themselves and the timing of use should 
be controlled by the framework.

So it should look like:

{code}
public abstract class AbstractTrainer {
  @Depracated
  public boolean isValid() { ... }

  // if the subclass overrides this, it should call super.validate();
  public void validate() throws IllegalArgumentException {
// default implementation here
  }

  // this is the controller of the flow of training...
  public final void train() {
// initializing 
init();

// validating parameters
validate();
  }
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (OPENNLP-1039) PerceptronTrainer should call super.isValid() in its isValid()

2017-04-23 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi closed OPENNLP-1039.
---

> PerceptronTrainer should call super.isValid() in its isValid()
> --
>
> Key: OPENNLP-1039
> URL: https://issues.apache.org/jira/browse/OPENNLP-1039
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.0
>
>
> The current implementation of PerceptronTrainer#isValid() is:
> {code}
>   public boolean isValid() {
> String algorithmName = getAlgorithm();
> return !(algorithmName != null && 
> !(PERCEPTRON_VALUE.equals(algorithmName)));
>   }
> {code}
> but it should call super.isValid() to check iterations and cutoff parameters 
> because PerceptronTrainer uses them.
> And if possible, I'd like to rewrite the last line (return statement) because 
> I needed a few minutes to understand it as it has three exclamation points in 
> one line. :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (OPENNLP-1039) PerceptronTrainer should call super.isValid() in its isValid()

2017-04-23 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1039.
-
   Resolution: Fixed
Fix Version/s: 1.8.0

> PerceptronTrainer should call super.isValid() in its isValid()
> --
>
> Key: OPENNLP-1039
> URL: https://issues.apache.org/jira/browse/OPENNLP-1039
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.0
>
>
> The current implementation of PerceptronTrainer#isValid() is:
> {code}
>   public boolean isValid() {
> String algorithmName = getAlgorithm();
> return !(algorithmName != null && 
> !(PERCEPTRON_VALUE.equals(algorithmName)));
>   }
> {code}
> but it should call super.isValid() to check iterations and cutoff parameters 
> because PerceptronTrainer uses them.
> And if possible, I'd like to rewrite the last line (return statement) because 
> I needed a few minutes to understand it as it has three exclamation points in 
> one line. :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (OPENNLP-1039) PerceptronTrainer should call super.isValid() in its isValid()

2017-04-23 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned OPENNLP-1039:
---

Assignee: Koji Sekiguchi

> PerceptronTrainer should call super.isValid() in its isValid()
> --
>
> Key: OPENNLP-1039
> URL: https://issues.apache.org/jira/browse/OPENNLP-1039
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
>
> The current implementation of PerceptronTrainer#isValid() is:
> {code}
>   public boolean isValid() {
> String algorithmName = getAlgorithm();
> return !(algorithmName != null && 
> !(PERCEPTRON_VALUE.equals(algorithmName)));
>   }
> {code}
> but it should call super.isValid() to check iterations and cutoff parameters 
> because PerceptronTrainer uses them.
> And if possible, I'd like to rewrite the last line (return statement) because 
> I needed a few minutes to understand it as it has three exclamation points in 
> one line. :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   >