[jira] [Commented] (OPENNLP-1261) Language Detector fails to predict language on long input texts

2019-06-12 Thread Joern Kottmann (JIRA)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861784#comment-16861784
 ] 

Joern Kottmann commented on OPENNLP-1261:
-

One of the issues with performance in the feature generation output is that 
there are strings, I once worked on a prototype where the strings where hashed, 
and the model then used these ints for the prediction, that turned out to be a 
bit faster (because it doesn't need to create all those String objects).
If performance is an issue, we should try to get that released.

GISModel.eval(String[], float[]), I need  to take a look at it, and if it can 
be used we should make a test with it to see how much it helps.

The context generator could try to limit the number of ngrams by removing very 
rare ones, e.g. via cutoff or something similar.

Thanks for your help!


> Language Detector fails to predict language on long input texts
> ---
>
> Key: OPENNLP-1261
> URL: https://issues.apache.org/jira/browse/OPENNLP-1261
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Language Detector
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Major
> Attachments: langid_plus_minus_rollups.zip, opennlp_as_is_vs_1261.zip
>
>
> If the input text is very long, e.g. 100k chars, then the lang detect 
> component fails to detect the language correctly, even though the text is 
> only written in one language.
> This issue was tracked down to the context generator, where the count of the 
> ngrams are ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (OPENNLP-1262) LanguageDetectorEvaluatorTest failure in windows

2019-05-25 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1262.
---
   Resolution: Fixed
 Assignee: Joern Kottmann
Fix Version/s: 1.9.2

> LanguageDetectorEvaluatorTest failure in windows
> 
>
> Key: OPENNLP-1262
> URL: https://issues.apache.org/jira/browse/OPENNLP-1262
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Language Detector
>Reporter: Madhawa Gunasekara
>Assignee: Joern Kottmann
>Priority: Major
> Fix For: 1.9.2
>
>
> Test failed with 
> Maven Version : 
> Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 
> 2018-10-24T20:41:47+02:00)
> Maven home: C:\Tools\apache-maven-3.6.0\bin\..
> Java version: 1.8.0_202, vendor: Oracle Corporation, runtime: C:\Program 
> Files\Java\jdk1.8.0_202\jre
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
> Results :
> Failed tests:
>  LanguageDetectorEvaluatorTest.processSample:75 expected:<...ed Predicted 
> Context[
> fra pob escreve e faz palestras pelo mundo inteiro sobre anjos
> fra pob escreve e faz palestras pelo mundo inteiro sobre anjos]
> > but was:<...ed Predicted Context[
> fra pob escreve e faz palestras pelo mundo inteiro sobre anjos
> ]ra pob escreve e faz palestras pelo mundo inteiro sobre anjos
> >
> Tests run: 767, Failures: 1, Errors: 0, Skipped: 0
> [INFO] 
> 
> [INFO] Reactor Summary for Apache OpenNLP Reactor 1.9.2-SNAPSHOT:
> [INFO]
> [INFO] Apache OpenNLP Reactor . SUCCESS [ 5.190 s]
> [INFO] Apache OpenNLP Tools ... FAILURE [01:40 
> min]
> [INFO] Apache OpenNLP UIMA Annotators . SKIPPED
> [INFO] Apache OpenNLP Brat Annotator .. SKIPPED
> [INFO] Apache OpenNLP Morfologik Addon  SKIPPED
> [INFO] Apache OpenNLP Documentation ... SKIPPED
> [INFO] Apache OpenNLP Distribution  SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 01:46 min
> [INFO] Finished at: 2019-05-23T16:35:43+02:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on 
> project opennlp-tools: There are test failures.
> [ERROR]
> [ERROR] Please refer to 
> C:\Developments\hobbies\opennlp\opennlp-tools\target\surefire-reports for the 
> individual test results.
> [ERROR] -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR] mvn  -rf :opennlp-tools
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OPENNLP-1262) LanguageDetectorEvaluatorTest failure in windows

2019-05-25 Thread Joern Kottmann (JIRA)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848104#comment-16848104
 ] 

Joern Kottmann commented on OPENNLP-1262:
-

Thanks for contributing the fix!

It is now merged.

> LanguageDetectorEvaluatorTest failure in windows
> 
>
> Key: OPENNLP-1262
> URL: https://issues.apache.org/jira/browse/OPENNLP-1262
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Language Detector
>Reporter: Madhawa Gunasekara
>Priority: Major
>
> Test failed with 
> Maven Version : 
> Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 
> 2018-10-24T20:41:47+02:00)
> Maven home: C:\Tools\apache-maven-3.6.0\bin\..
> Java version: 1.8.0_202, vendor: Oracle Corporation, runtime: C:\Program 
> Files\Java\jdk1.8.0_202\jre
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
> Results :
> Failed tests:
>  LanguageDetectorEvaluatorTest.processSample:75 expected:<...ed Predicted 
> Context[
> fra pob escreve e faz palestras pelo mundo inteiro sobre anjos
> fra pob escreve e faz palestras pelo mundo inteiro sobre anjos]
> > but was:<...ed Predicted Context[
> fra pob escreve e faz palestras pelo mundo inteiro sobre anjos
> ]ra pob escreve e faz palestras pelo mundo inteiro sobre anjos
> >
> Tests run: 767, Failures: 1, Errors: 0, Skipped: 0
> [INFO] 
> 
> [INFO] Reactor Summary for Apache OpenNLP Reactor 1.9.2-SNAPSHOT:
> [INFO]
> [INFO] Apache OpenNLP Reactor . SUCCESS [ 5.190 s]
> [INFO] Apache OpenNLP Tools ... FAILURE [01:40 
> min]
> [INFO] Apache OpenNLP UIMA Annotators . SKIPPED
> [INFO] Apache OpenNLP Brat Annotator .. SKIPPED
> [INFO] Apache OpenNLP Morfologik Addon  SKIPPED
> [INFO] Apache OpenNLP Documentation ... SKIPPED
> [INFO] Apache OpenNLP Distribution  SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 01:46 min
> [INFO] Finished at: 2019-05-23T16:35:43+02:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on 
> project opennlp-tools: There are test failures.
> [ERROR]
> [ERROR] Please refer to 
> C:\Developments\hobbies\opennlp\opennlp-tools\target\surefire-reports for the 
> individual test results.
> [ERROR] -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR] mvn  -rf :opennlp-tools
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (OPENNLP-1263) Build Warnings due to deprecated pom.version

2019-05-24 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1263.
---
   Resolution: Fixed
 Assignee: Joern Kottmann
Fix Version/s: 1.9.2

> Build Warnings due to deprecated pom.version 
> -
>
> Key: OPENNLP-1263
> URL: https://issues.apache.org/jira/browse/OPENNLP-1263
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Madhawa Gunasekara
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.9.2
>
>
> Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 
> 2018-10-24T20:41:47+02:00)
> Maven home: C:\Tools\apache-maven-3.6.0\bin\..
> Java version: 1.8.0_202, vendor: Oracle Corporation, runtime: C:\Program 
> Files\Java\jdk1.8.0_202\jre
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
> [INFO] Scanning for projects...
> [WARNING]
> [WARNING] Some problems were encountered while building the effective model 
> for org.apache.opennlp:opennlp-distr:pom:1.9.2-SNAPSHOT
> [WARNING] The expression ${pom.version} is deprecated. Please use 
> ${project.version} instead.
> [WARNING]
> [WARNING] It is highly recommended to fix these problems because they 
> threaten the stability of your build.
> [WARNING]
> [WARNING] For this reason, future Maven versions might no longer support 
> building such malformed projects.
> [WARNING]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OPENNLP-1263) Build Warnings due to deprecated pom.version

2019-05-24 Thread Joern Kottmann (JIRA)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847465#comment-16847465
 ] 

Joern Kottmann commented on OPENNLP-1263:
-

Thanks for sending us the fix!

It is now merged.

> Build Warnings due to deprecated pom.version 
> -
>
> Key: OPENNLP-1263
> URL: https://issues.apache.org/jira/browse/OPENNLP-1263
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Madhawa Gunasekara
>Priority: Minor
>
> Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 
> 2018-10-24T20:41:47+02:00)
> Maven home: C:\Tools\apache-maven-3.6.0\bin\..
> Java version: 1.8.0_202, vendor: Oracle Corporation, runtime: C:\Program 
> Files\Java\jdk1.8.0_202\jre
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
> [INFO] Scanning for projects...
> [WARNING]
> [WARNING] Some problems were encountered while building the effective model 
> for org.apache.opennlp:opennlp-distr:pom:1.9.2-SNAPSHOT
> [WARNING] The expression ${pom.version} is deprecated. Please use 
> ${project.version} instead.
> [WARNING]
> [WARNING] It is highly recommended to fix these problems because they 
> threaten the stability of your build.
> [WARNING]
> [WARNING] For this reason, future Maven versions might no longer support 
> building such malformed projects.
> [WARNING]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1261) Lang Detect fails to predict language on long input texts

2019-05-22 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reassigned OPENNLP-1261:
---

Assignee: Joern Kottmann

> Lang Detect fails to predict language on long input texts
> -
>
> Key: OPENNLP-1261
> URL: https://issues.apache.org/jira/browse/OPENNLP-1261
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Major
>
> If the input text is very long, e.g. 100k chars, then the lang detect 
> component fails to detect the language correctly, even though the text is 
> only written in one language.
> This issue was tracked down to the context generator, where the count of the 
> ngrams are ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1261) Language Detector fails to predict language on long input texts

2019-05-22 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1261:

Component/s: Language Detector
Summary: Language Detector fails to predict language on long input 
texts  (was: Lang Detect fails to predict language on long input texts)

> Language Detector fails to predict language on long input texts
> ---
>
> Key: OPENNLP-1261
> URL: https://issues.apache.org/jira/browse/OPENNLP-1261
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Language Detector
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Major
>
> If the input text is very long, e.g. 100k chars, then the lang detect 
> component fails to detect the language correctly, even though the text is 
> only written in one language.
> This issue was tracked down to the context generator, where the count of the 
> ngrams are ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1261) Lang Detect fails to predict language on long input texts

2019-05-22 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1261:
---

 Summary: Lang Detect fails to predict language on long input texts
 Key: OPENNLP-1261
 URL: https://issues.apache.org/jira/browse/OPENNLP-1261
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann


If the input text is very long, e.g. 100k chars, then the lang detect component 
fails to detect the language correctly, even though the text is only written in 
one language.

This issue was tracked down to the context generator, where the count of the 
ngrams are ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OPENNLP-1237) Document Categorizer example references non-existant method signature

2019-02-07 Thread Joern Kottmann (JIRA)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763378#comment-16763378
 ] 

Joern Kottmann commented on OPENNLP-1237:
-

Thanks for opening this issue.

Do you want to send us a PR to fix it?

> Document Categorizer example references non-existant method signature
> -
>
> Key: OPENNLP-1237
> URL: https://issues.apache.org/jira/browse/OPENNLP-1237
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.9.1
>Reporter: Nick Burch
>Priority: Major
>
> In the Document Categorizer section of the manual 
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat there 
> is a code snippet in the training section:
> {code}
>   model = DocumentCategorizerME.train("en", sampleStream);
> {code}
> However, no matching method is present in the javadocs at 
> https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html
>  . The nearest seems to be one that takes two additional required parameters: 
> https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html#train-java.lang.String-opennlp.tools.util.ObjectStream-opennlp.tools.util.TrainingParameters-opennlp.tools.doccat.DoccatFactory-
> It looks like the code snippet is out of date, and needs updating to cover 
> the API changes that seem to have happened



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1236) Add support for Arabic and Greek stemmers

2019-01-28 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1236:

Component/s: Stemmer

> Add support for Arabic and Greek stemmers
> -
>
> Key: OPENNLP-1236
> URL: https://issues.apache.org/jira/browse/OPENNLP-1236
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Stemmer
>Reporter: Maxime Steinmetz
>Priority: Minor
> Fix For: 1.9.2
>
>
> The arabic and greek Snowball stemmers are now available 
> (https://github.com/snowballstem/snowball/tree/master/algorithms) and it 
> would be nice to add support for those two
>  
> This would require:
>  * Converting the .sbl files into Java code and adding it to the stemmer 
> folder
>  * Updating relevant classes to support the new stemmers
>  * Adding a tests for the new stemmers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1236) Add support for Arabic and Greek stemmers

2019-01-28 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reassigned OPENNLP-1236:
---

Assignee: Joern Kottmann

> Add support for Arabic and Greek stemmers
> -
>
> Key: OPENNLP-1236
> URL: https://issues.apache.org/jira/browse/OPENNLP-1236
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Stemmer
>Reporter: Maxime Steinmetz
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.9.2
>
>
> The arabic and greek Snowball stemmers are now available 
> (https://github.com/snowballstem/snowball/tree/master/algorithms) and it 
> would be nice to add support for those two
>  
> This would require:
>  * Converting the .sbl files into Java code and adding it to the stemmer 
> folder
>  * Updating relevant classes to support the new stemmers
>  * Adding a tests for the new stemmers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (OPENNLP-1236) Add support for Arabic and Greek stemmers

2019-01-28 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1236.
---
Resolution: Fixed

Thanks for contributing this!

> Add support for Arabic and Greek stemmers
> -
>
> Key: OPENNLP-1236
> URL: https://issues.apache.org/jira/browse/OPENNLP-1236
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Stemmer
>Reporter: Maxime Steinmetz
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.9.2
>
>
> The arabic and greek Snowball stemmers are now available 
> (https://github.com/snowballstem/snowball/tree/master/algorithms) and it 
> would be nice to add support for those two
>  
> This would require:
>  * Converting the .sbl files into Java code and adding it to the stemmer 
> folder
>  * Updating relevant classes to support the new stemmers
>  * Adding a tests for the new stemmers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1236) Add support for Arabic and Greek stemmers

2019-01-28 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1236:

Fix Version/s: (was: 1.9.1)
   1.9.2

> Add support for Arabic and Greek stemmers
> -
>
> Key: OPENNLP-1236
> URL: https://issues.apache.org/jira/browse/OPENNLP-1236
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Maxime Steinmetz
>Priority: Major
> Fix For: 1.9.2
>
>
> The arabic and greek Snowball stemmers are now available 
> (https://github.com/snowballstem/snowball/tree/master/algorithms) and it 
> would be nice to add support for those two
>  
> This would require:
>  * Converting the .sbl files into Java code and adding it to the stemmer 
> folder
>  * Updating relevant classes to support the new stemmers
>  * Adding a tests for the new stemmers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OPENNLP-1236) Add support for Arabic and Greek stemmers

2019-01-28 Thread Joern Kottmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1236:

Priority: Minor  (was: Major)

> Add support for Arabic and Greek stemmers
> -
>
> Key: OPENNLP-1236
> URL: https://issues.apache.org/jira/browse/OPENNLP-1236
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Maxime Steinmetz
>Priority: Minor
> Fix For: 1.9.2
>
>
> The arabic and greek Snowball stemmers are now available 
> (https://github.com/snowballstem/snowball/tree/master/algorithms) and it 
> would be nice to add support for those two
>  
> This would require:
>  * Converting the .sbl files into Java code and adding it to the stemmer 
> folder
>  * Updating relevant classes to support the new stemmers
>  * Adding a tests for the new stemmers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OPENNLP-1223) Add NameFinder model based on Tiger

2018-12-11 Thread Joern Kottmann (JIRA)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716937#comment-16716937
 ] 

Joern Kottmann commented on OPENNLP-1223:
-

Let me contact them and I will then report on the outcome here.

> Add NameFinder model based on Tiger
> ---
>
> Key: OPENNLP-1223
> URL: https://issues.apache.org/jira/browse/OPENNLP-1223
> Project: OpenNLP
>  Issue Type: New Feature
>  Components: language model
>Reporter: J. Fiala
>Priority: Major
> Attachments: tiger_2.2_namefinder.bin.7z, 
> tiger_2.2_namefinder.testdata.txt, 
> tiger_2.2_namefinder_all.bin_20181014.bin.7z, tiger_2.2_namefinder_eval.txt
>
>
> Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart - 
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>  h3. Tasks:
> 1.) add model based on tiger (/)
> >>> generated based on 6.271 sentences with tagged names (always given name + 
> >>> surname).
> 2.) add a few test sentences (/)
> 3.) add small evaluation file (/)
> 4.) check licensing issues (?)
> [http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/index.html]
> Contact information: "If you are interested in a commercial license of the 
> TIGERCorpus, please contact the secretary of Prof. Hans Uszkoreit's chair at 
> Saarland University at sek-hu AT coli DOT uni-saarland DOT de."
> * Clarify if a commercial license is needed for distributing a model derived 
> on the corpus.
> h3. Input data
>  * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
>  www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
>  * yagoLabels.tsv.7z (Max Planck Institute)
>  
> [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]
> h3. Basic workflow
> 1.) Extract sentences in the tiger database with possible names (two words in 
> sequence tagged as NE)
> 2.) Check if possible names include a given name based on the YAGO labels 
> database (given name is assumed as first name)
> 3.) If given name is included in YAGO labels as givenName, then tag the 
> person name
> 4.) Train with full data set (50.472 sentences - including non-names)
> 5.) Evaluate with person data set (6.271 sentences)
>  >>> JF 14.10.: see updated model: 
> tiger_2.2_namefinder_all.bin_20181014.bin.7z
> h3. Open questions
> I first extracted 6.271 sentences mentioning names and trained based on that 
> (filtered) data. Or is it better to use the complete training data (including 
> the sentences without names)? (/)
> >>> JF 14.10.: added steps 4 + 5
> h3. Results
> Results from step 5 above:
> Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct: 
> 7644.
>     TOTAL: precision:   99,77%;  recall:   99,80%; F1:   99,78%.
>    person: precision:   99,77%;  recall:   99,80%; F1:   99,78%. [target: 
> 7659; tp: 7644; fp:  18]
>  
> h3. Further Improvements:
> 1.) There may be some names which are referring to locations which have to be 
> refined (e.g. San Juan):
> Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint 
>  Salvador Lopez Gonzalez , das Oberhaupt von 
>  San Juan   Juan Chamula  , einem 
> pittoresken Ort hoch in den Bergen von .).
> 2.) Add support for names with more than two words (e.g. Salvador Lopez 
> Gonzalez above).
> 3.) Check for context-sensitive non-name matches (e.g. "General")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OPENNLP-1223) Add NameFinder model based on Tiger

2018-12-10 Thread Joern Kottmann (JIRA)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714792#comment-16714792
 ] 

Joern Kottmann commented on OPENNLP-1223:
-

The Apache License allows commercial use. 

And the license of the Tiger corpus says "Use of the corpus or use of data 
derived from the corpus for any commercial purposes requires explicit written 
agreement of Licenser.".

I think we would need such a written agreement.

> Add NameFinder model based on Tiger
> ---
>
> Key: OPENNLP-1223
> URL: https://issues.apache.org/jira/browse/OPENNLP-1223
> Project: OpenNLP
>  Issue Type: New Feature
>  Components: language model
>Reporter: J. Fiala
>Priority: Major
> Attachments: tiger_2.2_namefinder.bin.7z, 
> tiger_2.2_namefinder.testdata.txt, 
> tiger_2.2_namefinder_all.bin_20181014.bin.7z, tiger_2.2_namefinder_eval.txt
>
>
> Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart - 
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>  
> 1.) add model based on tiger (/)
> >>> generated based on 6.271 sentences with tagged names (always given name + 
> >>> surname).
> 2.) add a few test sentences (/)
> 3.) add small evaluation file (/)
>  
> h3. Input data
>  * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
>  www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
>  * yagoLabels.tsv.7z (Max Planck Institute)
>  
> [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]
> h3. Basic workflow
> 1.) Extract sentences in the tiger database with possible names (two words in 
> sequence tagged as NE)
> 2.) Check if possible names include a given name based on the YAGO labels 
> database (given name is assumed as first name)
> 3.) If given name is included in YAGO labels as givenName, then tag the 
> person name
> 4.) Train with full data set (50.472 sentences - including non-names)
> 5.) Evaluate with person data set (6.271 sentences)
> >>> JF 14.10.: see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
> h3. Open questions
> I first extracted 6.271 sentences mentioning names and trained based on that 
> (filtered) data. Or is it better to use the complete training data (including 
> the sentences without names)? (/)
> >>> JF 14.10.: added steps 4 + 5
> h3. Results
> Results from step 5 above:
> Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct: 
> 7644.
>     TOTAL: precision:   99,77%;  recall:   99,80%; F1:   99,78%.
>    person: precision:   99,77%;  recall:   99,80%; F1:   99,78%. [target: 
> 7659; tp: 7644; fp:  18]
>  
> h3. Further Improvements:
> 1.) There may be some names which are referring to locations which have to be 
> refined (e.g. San Juan):
> Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint 
>  Salvador Lopez Gonzalez , das Oberhaupt von 
>  San Juan   Juan Chamula  , einem 
> pittoresken Ort hoch in den Bergen von .).
> 2.) Add support for names with more than two words (e.g. Salvador Lopez 
> Gonzalez above).
> 3.) Check for context-sensitive non-name matches (e.g. "General")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner

2018-10-15 Thread Joern Kottmann (JIRA)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649846#comment-16649846
 ] 

Joern Kottmann commented on OPENNLP-1214:
-

I was +1 for this change because it looks a bit nicer than having the loop in 
the code here. Especially for a very small set size this is slower than just 
performing a linear scan through the array.

I doubt this change has any impact on the run-time when it is measured 
performing sentence splitting.

> use hash to avoid linear search in DefaultEndOfSentenceScanner
> --
>
> Key: OPENNLP-1214
> URL: https://issues.apache.org/jira/browse/OPENNLP-1214
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.9.1
>
>
> When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to 
> check if each characters in the sentence is one of eos characters. I think 
> we'd better use HashSet to keep eosCharacters instead of char[].
> In accordance with this replacement, I'd like to make 
> getEndOfSentenceCharacters() deprecated because it returns char[] and nobody 
> in OpenNLP calls it at present, and I'd like to add the equivalent method 
> which returns Set of eos chars. Though it cannot keep the order of 
> eos chars but I don't think it can be a problem anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1220) Add support for Byte Pair Encoding (BPE)

2018-09-25 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1220:
---

 Summary: Add support for Byte Pair Encoding (BPE)
 Key: OPENNLP-1220
 URL: https://issues.apache.org/jira/browse/OPENNLP-1220
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann


It would be nice to add support for BPE to OpenNLP:

[https://arxiv.org/pdf/1508.07909.pdf]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (OPENNLP-1200) Unify code to sum up input context features

2018-05-22 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1200.
---
Resolution: Fixed

> Unify code to sum up input context features
> ---
>
> Key: OPENNLP-1200
> URL: https://issues.apache.org/jira/browse/OPENNLP-1200
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Trivial
> Fix For: 1.8.5
>
>
> The code to sum up input features in the mal package is duplicated and should 
> be unified in a util method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (OPENNLP-1193) Brat format support fails on multi fragment annotations

2018-05-22 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1193.
---
Resolution: Fixed

> Brat format support fails on multi fragment annotations
> ---
>
> Key: OPENNLP-1193
> URL: https://issues.apache.org/jira/browse/OPENNLP-1193
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Formats, Name Finder
>Affects Versions: 1.8.4
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Major
> Fix For: 1.8.5
>
>
> The brat format support assumes that annotation with multiple fragments are 
> always appear next to each other, this assumption is false (and is only true 
> if there is a line break). If a single annotation is composed of multiple 
> fragments they should be outputted as multiple name spans as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1200) Unify code to sum up input context features

2018-05-22 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1200:
---

 Summary: Unify code to sum up input context features
 Key: OPENNLP-1200
 URL: https://issues.apache.org/jira/browse/OPENNLP-1200
 Project: OpenNLP
  Issue Type: Improvement
  Components: Machine Learning
Reporter: Joern Kottmann
Assignee: Joern Kottmann
 Fix For: 1.8.5


The code to sum up input features in the mal package is duplicated and should 
be unified in a util method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1193) Brat format support fails on multi fragment annotations

2018-04-12 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1193:
---

 Summary: Brat format support fails on multi fragment annotations
 Key: OPENNLP-1193
 URL: https://issues.apache.org/jira/browse/OPENNLP-1193
 Project: OpenNLP
  Issue Type: Bug
  Components: Formats, Name Finder
Affects Versions: 1.8.4
Reporter: Joern Kottmann
Assignee: Joern Kottmann
 Fix For: 1.8.5


The brat format support assumes that annotation with multiple fragments are 
always appear next to each other, this assumption is false (and is only true if 
there is a line break). If a single annotation is composed of multiple 
fragments they should be outputted as multiple name spans as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OPENNLP-1185) Tokenizers should be able to output a new line token

2018-02-09 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1185:
---

 Summary: Tokenizers should be able to output a new line token
 Key: OPENNLP-1185
 URL: https://issues.apache.org/jira/browse/OPENNLP-1185
 Project: OpenNLP
  Issue Type: Improvement
  Components: Tokenizer
Reporter: Joern Kottmann
Assignee: Peter Thygesen


Some use cases need the tokenizers to also output new line tokens. This is 
needed e.g. by cTakes to process clinical notes, or by the name finder to 
process list of names where each name is written in one line. Also it helps the 
name finder to process news articles.

To fix this issue add an option to all three tokenizers to emit new line tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (OPENNLP-1158) The Brat Annotation Service does not serialize results appropriately

2017-12-05 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reassigned OPENNLP-1158:
---

Assignee: Daniel Russ

> The Brat Annotation Service does not serialize results appropriately
> 
>
> Key: OPENNLP-1158
> URL: https://issues.apache.org/jira/browse/OPENNLP-1158
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Applications
>Affects Versions: 1.8.3
>Reporter: Daniel Russ
>Assignee: Daniel Russ
> Fix For: 1.8.4
>
>
> After starting up the BratAnnotatorService NameFinderAnnSerive, BRAT passes 
> text to the service, but it never returns.
>  curl -v   -H "Content-type: text/plain" -H "Accept: application/json" -X 
> POST -d "I am a fireman" localhost:8123/ner
> * About to connect() to localhost port 8123 (#0)
> *   Trying 127.0.0.1... connected
> * Connected to localhost (127.0.0.1) port 8123 (#0)
> > POST /ner HTTP/1.1
> > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 
> > zlib/1.2.3 libidn/1.18 libssh2/1.4.2
> > Host: localhost:8123
> > Content-type: text/plain
> > Accept: application/json
> > Content-Length: 14
> > 
> < HTTP/1.1 400 Bad Request
> < Content-Type: text/plain
> < Date: Tue, 21 Nov 2017 19:43:15 GMT
> < Connection: close
> < Content-Length: 247
> < 
> * Closing connection #0
> No serializer found for class opennlp.bratann.NameFinderResource$NameAnn and 
> no properties discovered to create BeanSerializer (to avoid exception, 
> disable SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: 
> java.util.HashMap["0"]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1155) Remove deprecated leipzig doccat format support

2017-12-05 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1155.
---
Resolution: Fixed

> Remove deprecated leipzig doccat format support
> ---
>
> Key: OPENNLP-1155
> URL: https://issues.apache.org/jira/browse/OPENNLP-1155
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Doccat, Formats
>Reporter: Joern Kottmann
>Assignee: Peter Thygesen
>Priority: Minor
> Fix For: 1.8.4
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1165) Add filename to overlapping annotation exception in NameSample

2017-12-04 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1165:

Fix Version/s: 1.8.4

> Add filename to overlapping annotation exception in NameSample
> --
>
> Key: OPENNLP-1165
> URL: https://issues.apache.org/jira/browse/OPENNLP-1165
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.3
>Reporter: Peter Thygesen
>Assignee: Peter Thygesen
>Priority: Minor
> Fix For: 1.8.4
>
>
> When reading Brat annotated files I noticed that if I had overlapped 
> annotations an exception was thrown, but it did not contain any information 
> about in which file the annotation overlap was found.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (OPENNLP-1144) Add support for word vector resources

2017-11-21 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reassigned OPENNLP-1144:
---

Assignee: Joern Kottmann

> Add support for word vector resources
> -
>
> Key: OPENNLP-1144
> URL: https://issues.apache.org/jira/browse/OPENNLP-1144
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
> Fix For: 1.8.3
>
>
> It would be nice to have support for word vector resources and parsing 
> support for the most common formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1144) Add support for word vector resources

2017-11-21 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1144.
---
Resolution: Fixed

> Add support for word vector resources
> -
>
> Key: OPENNLP-1144
> URL: https://issues.apache.org/jira/browse/OPENNLP-1144
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
> Fix For: 1.8.3
>
>
> It would be nice to have support for word vector resources and parsing 
> support for the most common formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1144) Add support for word vector resources

2017-11-21 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1144:

Fix Version/s: (was: 1.8.4)
   1.8.3

> Add support for word vector resources
> -
>
> Key: OPENNLP-1144
> URL: https://issues.apache.org/jira/browse/OPENNLP-1144
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
> Fix For: 1.8.3
>
>
> It would be nice to have support for word vector resources and parsing 
> support for the most common formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (OPENNLP-652) Add 20Newsgroups format support to the doccat component

2017-11-21 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reassigned OPENNLP-652:
--

 Assignee: Joern Kottmann
Fix Version/s: 1.8.4

> Add 20Newsgroups format support to the doccat component
> ---
>
> Key: OPENNLP-652
> URL: https://issues.apache.org/jira/browse/OPENNLP-652
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Doccat, Formats
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: help-wanted
> Fix For: 1.8.4
>
>
> It would be nice to have formats support for the 20Newsgroups data. The data 
> would be nice to have for a real demonstration of the doccat component.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1157) Remove tokenizer param from doccat trainer cli

2017-11-20 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1157.
---
Resolution: Fixed

> Remove tokenizer param from doccat trainer cli
> --
>
> Key: OPENNLP-1157
> URL: https://issues.apache.org/jira/browse/OPENNLP-1157
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface, Doccat
>Affects Versions: 1.8.3
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.8.4
>
>
> The parameter is not used for training after the tokenization support was 
> removed from doccat.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1157) Remove tokenizer param from doccat trainer cli

2017-11-20 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1157:
---

 Summary: Remove tokenizer param from doccat trainer cli
 Key: OPENNLP-1157
 URL: https://issues.apache.org/jira/browse/OPENNLP-1157
 Project: OpenNLP
  Issue Type: Bug
  Components: Command Line Interface, Doccat
Affects Versions: 1.8.3
Reporter: Joern Kottmann
Assignee: Joern Kottmann
Priority: Minor
 Fix For: 1.8.4


The parameter is not used for training after the tokenization support was 
removed from doccat.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1155) Remove deprecated leipzig doccat format support

2017-11-17 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1155:
---

 Summary: Remove deprecated leipzig doccat format support
 Key: OPENNLP-1155
 URL: https://issues.apache.org/jira/browse/OPENNLP-1155
 Project: OpenNLP
  Issue Type: Improvement
  Components: Doccat, Formats
Reporter: Joern Kottmann
Assignee: Peter Thygesen
Priority: Minor
 Fix For: 1.8.4






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1152) Move contents of RELEASE_NOTES.html into README.html

2017-10-24 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1152.
---
Resolution: Fixed

> Move contents of RELEASE_NOTES.html into README.html
> 
>
> Key: OPENNLP-1152
> URL: https://issues.apache.org/jira/browse/OPENNLP-1152
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.8.3
>
>
> The two files are almost duplicated and the two sections about filling issues 
> and links to the jira issues inside the release could be as well in the 
> README.html.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1152) Move contents of RELEASE_NOTES.html into README.html

2017-10-24 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1152:
---

 Summary: Move contents of RELEASE_NOTES.html into README.html
 Key: OPENNLP-1152
 URL: https://issues.apache.org/jira/browse/OPENNLP-1152
 Project: OpenNLP
  Issue Type: Improvement
  Components: Build, Packaging and Test
Reporter: Joern Kottmann
Assignee: Joern Kottmann
Priority: Minor
 Fix For: 1.8.3


The two files are almost duplicated and the two sections about filling issues 
and links to the jira issues inside the release could be as well in the 
README.html.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1151) All Sample objects should implement Serializable for easy interation into other tools

2017-10-23 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1151:
---

 Summary: All Sample objects should implement Serializable for easy 
interation into other tools
 Key: OPENNLP-1151
 URL: https://issues.apache.org/jira/browse/OPENNLP-1151
 Project: OpenNLP
  Issue Type: Bug
Reporter: Joern Kottmann
Priority: Minor
 Fix For: 1.8.3


State of the Art frameworks like Apache Flink require that objects are 
serializable to use them in the pipeline. To use it to prepare training date 
for OpenNLP the Sample objects should all implement Serializable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (OPENNLP-1133) The Gaussian smoother cannot be used

2017-10-10 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reopened OPENNLP-1133:
-
  Assignee: Daniel Russ

> The Gaussian smoother cannot be used
> 
>
> Key: OPENNLP-1133
> URL: https://issues.apache.org/jira/browse/OPENNLP-1133
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Machine Learning
>Affects Versions: 1.8.2
>Reporter: Daniel Russ
>Assignee: Daniel Russ
> Fix For: 1.8.3
>
>
> In the  GISTrainer, the variable useGaussianSmoothing cannot be set using the 
> TrainingParameters



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1133) The Gaussian smoother cannot be used

2017-10-10 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1133.
---
Resolution: Fixed

> The Gaussian smoother cannot be used
> 
>
> Key: OPENNLP-1133
> URL: https://issues.apache.org/jira/browse/OPENNLP-1133
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Machine Learning
>Affects Versions: 1.8.2
>Reporter: Daniel Russ
>Assignee: Daniel Russ
> Fix For: 1.8.3
>
>
> In the  GISTrainer, the variable useGaussianSmoothing cannot be set using the 
> TrainingParameters



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1131) LeipzigLanguageSampleStreamFactory should not load hidden files

2017-10-10 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1131.
---
Resolution: Fixed

> LeipzigLanguageSampleStreamFactory should not load hidden files
> ---
>
> Key: OPENNLP-1131
> URL: https://issues.apache.org/jira/browse/OPENNLP-1131
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Language Detector
>Affects Versions: 1.8.2
>Reporter: Peter Thygesen
>Assignee: Peter Thygesen
> Fix For: 1.8.3
>
>
> .DS_Store file is loaded as a sentence sample file. This is should not happen.
> Exception in thread "main" java.io.UncheckedIOException: 
> java.nio.charset.MalformedInputException: Input length = 1
>   at java.io.BufferedReader$1.hasNext(BufferedReader.java:574)
>   at java.util.Iterator.forEachRemaining(Iterator.java:115)
>   at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at java.util.stream.LongPipeline.reduce(LongPipeline.java:438)
>   at java.util.stream.LongPipeline.sum(LongPipeline.java:396)
>   at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526)
>   at 
> opennlp.tools.formats.leipzig.LeipzigLanguageSampleStream$LeipzigSentencesStream.(LeipzigLanguageSampleStream.java:57)
>   at 
> opennlp.tools.formats.leipzig.LeipzigLanguageSampleStream.read(LeipzigLanguageSampleStream.java:157)
>   at 
> opennlp.tools.formats.leipzig.LeipzigLanguageSampleStream.read(LeipzigLanguageSampleStream.java:42)
>   at 
> opennlp.tools.formats.leipzig.SampleShuffleStream.(SampleShuffleStream.java:38)
>   at 
> opennlp.tools.formats.leipzig.LeipzigLanguageSampleStreamFactory.create(LeipzigLanguageSampleStreamFactory.java:76)
>   at 
> opennlp.tools.cmdline.AbstractConverterTool.run(AbstractConverterTool.java:106)
>   at opennlp.tools.cmdline.CLI.main(CLI.java:256)
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at java.io.BufferedReader$1.hasNext(BufferedReader.java:571)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1127) Fix readme HTML file generated for distribution archives

2017-10-10 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1127.
---
Resolution: Fixed

> Fix readme HTML file generated for distribution archives
> 
>
> Key: OPENNLP-1127
> URL: https://issues.apache.org/jira/browse/OPENNLP-1127
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Build, Packaging and Test
>Affects Versions: 1.8.1
>Reporter: Bruno P. Kinoshita
>Assignee: Bruno P. Kinoshita
>Priority: Trivial
>  Labels: build, documentation
> Fix For: 1.8.2
>
>
> The current README.md file, in the project root directory, is used by the 
> opennlp-distr module. The readme file is included in distribution files. 
> There are a few changes in the master branch that were not released yet.
> Running `mvn clean install` will create the distribution files, and inside 
> you should find a README.html created based on the README.md file, plus other 
> files.
> The Markdown to HTML generation is being done through 
> https://github.com/walokra/markdown-page-generator-plugin.
> This issue is for enhancements in the README file and also around the 
> markdown-page-generator-plugin use. As 1.8.2 release is in progress, this may 
> be included in 1.8.3.
> Cheers
> Bruno



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1112) Jenkins should publish daily snapshot builds

2017-10-10 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1112.
---
   Resolution: Fixed
Fix Version/s: (was: 1.8.3)

> Jenkins should publish daily snapshot builds
> 
>
> Key: OPENNLP-1112
> URL: https://issues.apache.org/jira/browse/OPENNLP-1112
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>
> Jenkins should publish a snapshot build every time the master branch is 
> updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1144) Add support for word vector resources

2017-10-09 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1144:
---

 Summary: Add support for word vector resources
 Key: OPENNLP-1144
 URL: https://issues.apache.org/jira/browse/OPENNLP-1144
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann
 Fix For: 1.8.3


It would be nice to have support for word vector resources and parsing support 
for the most common formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1143) Set the Java 9 automatic module name

2017-10-09 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1143.
---
Resolution: Fixed

> Set the Java 9 automatic module name
> 
>
> Key: OPENNLP-1143
> URL: https://issues.apache.org/jira/browse/OPENNLP-1143
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.8.3
>
>
> The Java 9 module system derives the name from the jar file. This is not the 
> name we prefer and therefore it should be set in the manifest.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1143) Set the Java 9 automatic module name

2017-10-09 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1143:
---

 Summary: Set the Java 9 automatic module name
 Key: OPENNLP-1143
 URL: https://issues.apache.org/jira/browse/OPENNLP-1143
 Project: OpenNLP
  Issue Type: Improvement
  Components: Build, Packaging and Test
Reporter: Joern Kottmann
Assignee: Joern Kottmann
Priority: Minor
 Fix For: 1.8.3


The Java 9 module system derives the name from the jar file. This is not the 
name we prefer and therefore it should be set in the manifest.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1135) Remove support for OSGi

2017-09-26 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1135:

Fix Version/s: (was: 1.8.3)

> Remove support for OSGi
> ---
>
> Key: OPENNLP-1135
> URL: https://issues.apache.org/jira/browse/OPENNLP-1135
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Priority: Minor
>
> Remove the OSGi bundle support from the opennlp-tools jar. OSGi isn't used 
> widely and the ones who are using it know how to use opennlp-tools in an OSGi 
> environment anyway by applying some build tricks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1135) Remove support for OSGi

2017-09-26 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1135:
---

 Summary: Remove support for OSGi
 Key: OPENNLP-1135
 URL: https://issues.apache.org/jira/browse/OPENNLP-1135
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann
Priority: Minor
 Fix For: 1.8.3


Remove the OSGi bundle support from the opennlp-tools jar. OSGi isn't used 
widely and the ones who are using it know how to use opennlp-tools in an OSGi 
environment anyway by applying some build tricks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1134) Remove dependencies on java.logging

2017-09-26 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1134.
---
Resolution: Fixed

> Remove dependencies on java.logging
> ---
>
> Key: OPENNLP-1134
> URL: https://issues.apache.org/jira/browse/OPENNLP-1134
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.8.3
>
>
> In two cases this is used instead of System.out.println. It would be better 
> to remove this and either ignore errors or fail with an exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1134) Remove dependencies on java.logging

2017-09-26 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1134:
---

 Summary: Remove dependencies on java.logging
 Key: OPENNLP-1134
 URL: https://issues.apache.org/jira/browse/OPENNLP-1134
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann
Assignee: Joern Kottmann
Priority: Minor
 Fix For: 1.8.3


In two cases this is used instead of System.out.println. It would be better to 
remove this and either ignore errors or fail with an exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1126) Consolidate the two README files into README.md

2017-09-04 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1126.
---
Resolution: Fixed

> Consolidate the two README files into README.md
> ---
>
> Key: OPENNLP-1126
> URL: https://issues.apache.org/jira/browse/OPENNLP-1126
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.8.2
>
>
> It would be nicer to have just the README.md file and render it to html for 
> the binary distribution. The two files, README.md and README, currently 
> mostly duplicate their content.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1126) Consolidate the two README files into README.md

2017-09-04 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1126:
---

 Summary: Consolidate the two README files into README.md
 Key: OPENNLP-1126
 URL: https://issues.apache.org/jira/browse/OPENNLP-1126
 Project: OpenNLP
  Issue Type: Improvement
  Components: Build, Packaging and Test
Reporter: Joern Kottmann
Assignee: Joern Kottmann
Priority: Minor
 Fix For: 1.8.2


It would be nicer to have just the README.md file and render it to html for the 
binary distribution. The two files, README.md and README, currently mostly 
duplicate their content.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1122) Leipzig sample should allow skip initial entries

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1122.
---

> Leipzig sample should allow skip initial entries
> 
>
> Key: OPENNLP-1122
> URL: https://issues.apache.org/jira/browse/OPENNLP-1122
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Formats
>Affects Versions: 1.8.2
>Reporter: William Colen
>Assignee: William Colen
> Fix For: 1.8.2
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (OPENNLP-1112) Jenkins should publish daily snapshot builds

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reassigned OPENNLP-1112:
---

Assignee: Joern Kottmann  (was: Suneel Marthi)

> Jenkins should publish daily snapshot builds
> 
>
> Key: OPENNLP-1112
> URL: https://issues.apache.org/jira/browse/OPENNLP-1112
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
> Fix For: 1.8.3
>
>
> Jenkins should publish a snapshot build every time the master branch is 
> updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1113) Identify why some eval tests fail on AMD processors

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1113:

Fix Version/s: (was: 1.8.2)
   1.8.3

> Identify why some eval tests fail on AMD processors
> ---
>
> Key: OPENNLP-1113
> URL: https://issues.apache.org/jira/browse/OPENNLP-1113
> Project: OpenNLP
>  Issue Type: Test
>Affects Versions: 1.8.1
>Reporter: Jeff Zemerick
>Assignee: Jeff Zemerick
>Priority: Minor
> Fix For: 1.8.3
>
> Attachments: failure.txt, success.txt
>
>
> When running the eval-tests for the 1.8.1 tag some of the tests consistently 
> fail on an EC2 instance. On another virtual machine the tests consistently 
> pass. When the tests fail the failures are consistent with the following:
> {quote}Failed tests: 
>   
> ArvoresDeitadasEval.evalPortugueseChunkerQnMultipleThreads:208->chunkerCrossEval:128
>  expected:<0.9649180953528779> but was:<0.9650518197155942>
>   
> ArvoresDeitadasEval.evalPortugueseSentenceDetectorMaxentQn:143->sentenceCrossEval:90
>  expected:<0.99261110833375> but was:<0.9927505074644777>
>   Conll02NameFinderEval.evalSpanishOrganizationMaxentQn:390->eval:90 
> expected:<0.682961897915169> but was:<0.6798418972332015>
>   ConllXPosTaggerEval.evalSwedishMaxentQn:152->eval:76 
> expected:<0.9347595473833098> but was:<0.9322842998585573>{quote}
> Both systems are Ubuntu 16.04.2 running OpenJDK 1.8.0_131 but there must be 
> some other differences affecting the tests. Those differences need to be 
> identified.
> *VM1 (Tests Consistently _Pass_)*
> Apache Maven 3.3.9
> Maven home: /usr/share/maven
> Java version: 1.8.0_131, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "4.4.0-1022-aws", arch: "amd64", family: "unix"
> LANG=en_US.UTF-8
> *VM2 (Tests Consistently _Fail_)*
> Apache Maven 3.3.9
> Maven home: /usr/share/maven
> Java version: 1.8.0_131, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "4.4.0-83-generic", arch: "amd64", family: "unix"
> LANG=en_US.UTF-8
> This VM also consistently fails when using Oracle JDK:
> Java version: 1.8.0_131, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-8-oracle/jre
> *VM3 (Tests Consistently _Pass_)*
> Apache Maven 3.3.9
> Maven home: /usr/share/maven
> Java version: 1.8.0_131, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "4.4.0-83-generic", arch: "amd64", family: "unix"
> *VM4 (Tests Consistently _Fail_)*
> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 
> 2015-11-10T11:41:47-05:00)
> Maven home: C:\Program Files (x86)\maven\bin\..
> Java version: 1.8.0_92, vendor: Oracle Corporation
> Java home: C:\Program Files\Java\jdk1.8.0_92\jre
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "dos"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1112) Jenkins should publish daily snapshot builds

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1112:

Fix Version/s: (was: 1.8.2)
   1.8.3

> Jenkins should publish daily snapshot builds
> 
>
> Key: OPENNLP-1112
> URL: https://issues.apache.org/jira/browse/OPENNLP-1112
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Suneel Marthi
> Fix For: 1.8.3
>
>
> Jenkins should publish a snapshot build every time the master branch is 
> updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-936) Add thread safe versions of some tools (ME sentence detection, tokenization, pos tagging)

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-936:
---
Fix Version/s: (was: 1.8.2)
   1.8.3

> Add thread safe versions of some tools (ME sentence detection, tokenization, 
> pos tagging)
> -
>
> Key: OPENNLP-936
> URL: https://issues.apache.org/jira/browse/OPENNLP-936
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 1.7.1
>Reporter: Thilo Goetz
>Priority: Minor
> Fix For: 1.8.3
>
>
> As discussed on the mailing list, add thread safe versions of maximum entropy 
> sentence detection, tokenization and pos tagging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1124) Optimize XML Parser configuration

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1124.
---
Resolution: Fixed

> Optimize XML Parser configuration
> -
>
> Key: OPENNLP-1124
> URL: https://issues.apache.org/jira/browse/OPENNLP-1124
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.2
>Reporter: William Colen
>Assignee: William Colen
> Fix For: 1.8.2
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1082) SentenceSampleStream should add EOS to samples if missing

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1082:

Fix Version/s: (was: 1.8.2)
   1.8.3

> SentenceSampleStream should add EOS to samples if missing
> -
>
> Key: OPENNLP-1082
> URL: https://issues.apache.org/jira/browse/OPENNLP-1082
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector
>Reporter: William Colen
>Assignee: William Colen
> Fix For: 1.8.3
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-976) Add formats support for germeval2014

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-976:
---
Fix Version/s: (was: 1.8.2)
   1.8.3

> Add formats support for germeval2014
> 
>
> Key: OPENNLP-976
> URL: https://issues.apache.org/jira/browse/OPENNLP-976
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Reporter: Joern Kottmann
>Assignee: Suneel Marthi
> Fix For: 1.8.3
>
>
> Details about the format can be found here:
> https://sites.google.com/site/germeval2014ner/data



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1106) Update the coref code to compile against 1.6.0

2017-08-31 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1106:

Fix Version/s: (was: 1.8.2)

> Update the coref code to compile against 1.6.0
> --
>
> Key: OPENNLP-1106
> URL: https://issues.apache.org/jira/browse/OPENNLP-1106
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Coref
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>
> It would be nice if the coref code would compile against an older release 
> version and gets the code a bit updated so it complies mostly with checkstyle 
> rules.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1017) OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException

2017-08-08 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118319#comment-16118319
 ] 

Joern Kottmann commented on OPENNLP-1017:
-

The code above breaks it evaluation, because you should not ignore the document 
boundaries which are used to reset adaptive data in the name finder.

The problem is that you Name Samples don't have the clear adaptive data flag 
set.

> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException
> -
>
> Key: OPENNLP-1017
> URL: https://issues.apache.org/jira/browse/OPENNLP-1017
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Saurabh Jain
>
> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException.
> With nfold value 3, I tried to cross validate NameFinder training data. After 
> doing a research I got to know that first partition is assinged null data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1017) OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException

2017-08-08 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118307#comment-16118307
 ] 

Joern Kottmann commented on OPENNLP-1017:
-

Right, usually this exception is thrown is something is wrong with the training 
data.

Documents in the default format are separated by empty lines, if you train with 
a single document you will also have issues with adaptive feature generators, 
they just don't throw exceptions.

> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException
> -
>
> Key: OPENNLP-1017
> URL: https://issues.apache.org/jira/browse/OPENNLP-1017
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Saurabh Jain
>
> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException.
> With nfold value 3, I tried to cross validate NameFinder training data. After 
> doing a research I got to know that first partition is assinged null data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1017) OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException

2017-08-08 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118281#comment-16118281
 ] 

Joern Kottmann commented on OPENNLP-1017:
-

[~neilireson] how many documents do you have in your training data?

> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException
> -
>
> Key: OPENNLP-1017
> URL: https://issues.apache.org/jira/browse/OPENNLP-1017
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Saurabh Jain
>
> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException.
> With nfold value 3, I tried to cross validate NameFinder training data. After 
> doing a research I got to know that first partition is assinged null data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1119) Leipzig sample stream should shuffle data

2017-07-28 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1119:
---

 Summary: Leipzig sample stream should shuffle data 
 Key: OPENNLP-1119
 URL: https://issues.apache.org/jira/browse/OPENNLP-1119
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats
Reporter: Joern Kottmann
Assignee: Joern Kottmann
 Fix For: 1.8.2


The Leipzig language data files are sorted by the first token of a sentence and 
the output is also sorted bylanguge.

To improve this the following should be done:
- The samples should be build from randomly picked lines taken from a sentences 
file
- The samples in the stream should be shuffled



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1115) Document Categorizer all events dropped

2017-07-11 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1115.
---
Resolution: Won't Fix

Please ask questions on the user list.

> Document Categorizer all events dropped
> ---
>
> Key: OPENNLP-1115
> URL: https://issues.apache.org/jira/browse/OPENNLP-1115
> Project: OpenNLP
>  Issue Type: Question
>  Components: Doccat
>Affects Versions: 1.7.2
>Reporter: Alessandro Depase
> Attachments: Train1.train
>
>
> Hi all,
> I'm trying to perform my first (newbie) document categorization using italian 
> language.
> I'm using the attached train file and i got this output:
> {{$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data 
> "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\source_gen\MrJEditor\sandbox\Train1.train"
>  -encoding UTF-8
> Indexing events using cutoff of 5
> Computing event counts...  done. 12 events
> Indexing...  Dropped event Ok:[bow=ok]
> Dropped event Ok:[bow=tutto, bow=bene]
> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> Dropped event Ok:[bow=fantastica, bow=scelta]
> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, 
> bow=così, bow=contento]
> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> Dropped event no:[bow=per, bow=nulla]
> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> Dropped event no:[bow=va, bow=malissimo]
> Dropped event no:[bow=va, bow=decisamente, bow=male]
> Dropped event no:[bow=sono, bow=molto, bow=triste]
> done.
> Sorting and merging events...
> ERROR: Not enough training data
> The provided training data is not sufficient to create enough events to train 
> a model.
> To resolve this error use more training data, if this doesn't help there might
> be some fundamental problem with the training data itself.}}
> I already found a couple of other similar issues, just saying that there are 
> not enough lines (but I have 6 lines for each category and a cutoff of 5) or 
> that without at least 100 lines the categorization quality is not sufficient 
> (ok, but that's just a quality matter, it should work, with bad results, but 
> it should work). The reason for insufficient data is that all the lines are 
> dropped.
> I also tried with java api, same result.
> But why? What did I miss? I cannot find useful documentation...
> Thank you in advance
> Kind Regards
> Alessandro



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1113) Identify why some eval tests fail on AMD processors

2017-07-11 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081948#comment-16081948
 ] 

Joern Kottmann commented on OPENNLP-1113:
-

Lets try to see in which part this goes wrong. I suggest we write out all the 
generated events to disk and see if there is a difference in event generation.

> Identify why some eval tests fail on AMD processors
> ---
>
> Key: OPENNLP-1113
> URL: https://issues.apache.org/jira/browse/OPENNLP-1113
> Project: OpenNLP
>  Issue Type: Test
>Affects Versions: 1.8.1
>Reporter: Jeff Zemerick
>Assignee: Jeff Zemerick
>Priority: Minor
> Fix For: 1.8.2
>
> Attachments: failure.txt, success.txt
>
>
> When running the eval-tests for the 1.8.1 tag some of the tests consistently 
> fail on an EC2 instance. On another virtual machine the tests consistently 
> pass. When the tests fail the failures are consistent with the following:
> {quote}Failed tests: 
>   
> ArvoresDeitadasEval.evalPortugueseChunkerQnMultipleThreads:208->chunkerCrossEval:128
>  expected:<0.9649180953528779> but was:<0.9650518197155942>
>   
> ArvoresDeitadasEval.evalPortugueseSentenceDetectorMaxentQn:143->sentenceCrossEval:90
>  expected:<0.99261110833375> but was:<0.9927505074644777>
>   Conll02NameFinderEval.evalSpanishOrganizationMaxentQn:390->eval:90 
> expected:<0.682961897915169> but was:<0.6798418972332015>
>   ConllXPosTaggerEval.evalSwedishMaxentQn:152->eval:76 
> expected:<0.9347595473833098> but was:<0.9322842998585573>{quote}
> Both systems are Ubuntu 16.04.2 running OpenJDK 1.8.0_131 but there must be 
> some other differences affecting the tests. Those differences need to be 
> identified.
> *VM1 (Tests Consistently _Pass_)*
> Apache Maven 3.3.9
> Maven home: /usr/share/maven
> Java version: 1.8.0_131, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "4.4.0-1022-aws", arch: "amd64", family: "unix"
> LANG=en_US.UTF-8
> *VM2 (Tests Consistently _Fail_)*
> Apache Maven 3.3.9
> Maven home: /usr/share/maven
> Java version: 1.8.0_131, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "4.4.0-83-generic", arch: "amd64", family: "unix"
> LANG=en_US.UTF-8
> This VM also consistently fails when using Oracle JDK:
> Java version: 1.8.0_131, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-8-oracle/jre
> *VM3 (Tests Consistently _Pass_)*
> Apache Maven 3.3.9
> Maven home: /usr/share/maven
> Java version: 1.8.0_131, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "4.4.0-83-generic", arch: "amd64", family: "unix"
> *VM4 (Tests Consistently _Fail_)*
> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 
> 2015-11-10T11:41:47-05:00)
> Maven home: C:\Program Files (x86)\maven\bin\..
> Java version: 1.8.0_92, vendor: Oracle Corporation
> Java home: C:\Program Files\Java\jdk1.8.0_92\jre
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "dos"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1013) [OpenNLP][R Language][1.5.3-2] Bug when using French models

2017-07-11 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1013.
---
   Resolution: Won't Fix
Fix Version/s: (was: 1.8.2)

> [OpenNLP][R Language][1.5.3-2] Bug when using French models
> ---
>
> Key: OPENNLP-1013
> URL: https://issues.apache.org/jira/browse/OPENNLP-1013
> Project: OpenNLP
>  Issue Type: Bug
>  Components: POS Tagger
>Affects Versions: tools-1.5.3
> Environment: R Language, RStudio
>Reporter: Iuri Deolindo Nogueira
>
> When using French models in R language, I'm receving a "subscript out of 
> bound" issue. I'm going to detail:
> -
> Well, I'm using French models to NLP in R environment. To get the french 
> models, I'm using binaries compiled and develloped by Nicolas:
> https://sites.google.com/site/nicolashernandez/resources/opennlp
> http://enicolashernandez.blogspot.fr/2012/12/apache-opennlp-fr-models.html
> https://drive.google.com/drive/folders/0B4AyWQriFkxgWHR6QzlvcmxmdE0
> -
> The problem it happens only with the POS function. This is how I call the 
> function and respective issue:
> Maxent_POS_Tag_Annotator(language = "fr", probs = TRUE, model = 
> paste0(, "fr-pos.bin"))
> Issue: 
> Error in environment(f)$meta[[tag]] : subscript out of bounds
> -
> However, if I deleted the language parameter, the issue does not happen 
> anymore:
> Maxent_POS_Tag_Annotator(probs = TRUE, model = 
> paste0(, "fr-pos.bin"))



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1013) [OpenNLP][R Language][1.5.3-2] Bug when using French models

2017-07-11 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081931#comment-16081931
 ] 

Joern Kottmann commented on OPENNLP-1013:
-

Doesn't seem likely that the POS Tagger fails on model loading, the unit tests 
we have are doing that a lot.
The openNLP R stuff, even tough it is called OpenNLP R, is not provided by this 
project.

> [OpenNLP][R Language][1.5.3-2] Bug when using French models
> ---
>
> Key: OPENNLP-1013
> URL: https://issues.apache.org/jira/browse/OPENNLP-1013
> Project: OpenNLP
>  Issue Type: Bug
>  Components: POS Tagger
>Affects Versions: tools-1.5.3
> Environment: R Language, RStudio
>Reporter: Iuri Deolindo Nogueira
> Fix For: 1.8.2
>
>
> When using French models in R language, I'm receving a "subscript out of 
> bound" issue. I'm going to detail:
> -
> Well, I'm using French models to NLP in R environment. To get the french 
> models, I'm using binaries compiled and develloped by Nicolas:
> https://sites.google.com/site/nicolashernandez/resources/opennlp
> http://enicolashernandez.blogspot.fr/2012/12/apache-opennlp-fr-models.html
> https://drive.google.com/drive/folders/0B4AyWQriFkxgWHR6QzlvcmxmdE0
> -
> The problem it happens only with the POS function. This is how I call the 
> function and respective issue:
> Maxent_POS_Tag_Annotator(language = "fr", probs = TRUE, model = 
> paste0(, "fr-pos.bin"))
> Issue: 
> Error in environment(f)$meta[[tag]] : subscript out of bounds
> -
> However, if I deleted the language parameter, the issue does not happen 
> anymore:
> Maxent_POS_Tag_Annotator(probs = TRUE, model = 
> paste0(, "fr-pos.bin"))



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1115) Document Categorizer all events dropped

2017-07-11 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081914#comment-16081914
 ] 

Joern Kottmann commented on OPENNLP-1115:
-

You are not training with enough data. Try to use much more data for training, 
e.g. 1000 lines (and they should not be copy and paste).

> Document Categorizer all events dropped
> ---
>
> Key: OPENNLP-1115
> URL: https://issues.apache.org/jira/browse/OPENNLP-1115
> Project: OpenNLP
>  Issue Type: Question
>  Components: Doccat
>Affects Versions: 1.7.2
>Reporter: Alessandro Depase
> Attachments: Train1.train
>
>
> Hi all,
> I'm trying to perform my first (newbie) document categorization using italian 
> language.
> I'm using the attached train file and i got this output:
> {{$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data 
> "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\source_gen\MrJEditor\sandbox\Train1.train"
>  -encoding UTF-8
> Indexing events using cutoff of 5
> Computing event counts...  done. 12 events
> Indexing...  Dropped event Ok:[bow=ok]
> Dropped event Ok:[bow=tutto, bow=bene]
> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> Dropped event Ok:[bow=fantastica, bow=scelta]
> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, 
> bow=così, bow=contento]
> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> Dropped event no:[bow=per, bow=nulla]
> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> Dropped event no:[bow=va, bow=malissimo]
> Dropped event no:[bow=va, bow=decisamente, bow=male]
> Dropped event no:[bow=sono, bow=molto, bow=triste]
> done.
> Sorting and merging events...
> ERROR: Not enough training data
> The provided training data is not sufficient to create enough events to train 
> a model.
> To resolve this error use more training data, if this doesn't help there might
> be some fundamental problem with the training data itself.}}
> I already found a couple of other similar issues, just saying that there are 
> not enough lines (but I have 6 lines for each category and a cutoff of 5) or 
> that without at least 100 lines the categorization quality is not sufficient 
> (ok, but that's just a quality matter, it should work, with bad results, but 
> it should work). The reason for insufficient data is that all the lines are 
> dropped.
> I also tried with java api, same result.
> But why? What did I miss? I cannot find useful documentation...
> Thank you in advance
> Kind Regards
> Alessandro



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1112) Travis should publish snapshot builds

2017-07-06 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1112:
---

 Summary: Travis should publish snapshot builds
 Key: OPENNLP-1112
 URL: https://issues.apache.org/jira/browse/OPENNLP-1112
 Project: OpenNLP
  Issue Type: Improvement
  Components: Build, Packaging and Test
Reporter: Joern Kottmann
 Fix For: 1.8.2


Travis should publish a snapshot build every time the master branch is updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1109) Change repositories in pom.xml to use GitHub repo

2017-07-03 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1109:
---

 Summary: Change repositories in pom.xml to use GitHub repo
 Key: OPENNLP-1109
 URL: https://issues.apache.org/jira/browse/OPENNLP-1109
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann
 Fix For: 1.8.1


We are now migrated to gitbox and have our main repository at GitHub. The 
pom.xml needs to be updated to reflect that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1082) SentenceSampleStream should add EOS to samples if missing

2017-07-03 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1082:

Fix Version/s: (was: 1.8.1)

> SentenceSampleStream should add EOS to samples if missing
> -
>
> Key: OPENNLP-1082
> URL: https://issues.apache.org/jira/browse/OPENNLP-1082
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector
>Reporter: William Colen
>Assignee: William Colen
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1082) SentenceSampleStream should add EOS to samples if missing

2017-07-03 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072532#comment-16072532
 ] 

Joern Kottmann commented on OPENNLP-1082:
-

This was reverted and will be done after 1.8.1.

> SentenceSampleStream should add EOS to samples if missing
> -
>
> Key: OPENNLP-1082
> URL: https://issues.apache.org/jira/browse/OPENNLP-1082
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector
>Reporter: William Colen
>Assignee: William Colen
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1108) Default eos char addition causes backward compatbility problems

2017-07-03 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1108.
---
Resolution: Won't Fix

> Default eos char addition causes backward compatbility problems
> ---
>
> Key: OPENNLP-1108
> URL: https://issues.apache.org/jira/browse/OPENNLP-1108
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Sentence Detector
>Reporter: Joern Kottmann
>
> The addition of the \n as a default eos doesn't work well with existing 
> models that don't contain a list of eos chars, and also doesn't work for 
> users who train with defaults, because they now get different results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (OPENNLP-1082) SentenceSampleStream should add EOS to samples if missing

2017-07-03 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reopened OPENNLP-1082:
-

> SentenceSampleStream should add EOS to samples if missing
> -
>
> Key: OPENNLP-1082
> URL: https://issues.apache.org/jira/browse/OPENNLP-1082
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector
>Reporter: William Colen
>Assignee: William Colen
> Fix For: 1.8.1
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1108) Default eos char addition causes backward compatbility problems

2017-07-03 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1108:

Fix Version/s: (was: 1.8.1)

> Default eos char addition causes backward compatbility problems
> ---
>
> Key: OPENNLP-1108
> URL: https://issues.apache.org/jira/browse/OPENNLP-1108
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Sentence Detector
>Reporter: Joern Kottmann
>
> The addition of the \n as a default eos doesn't work well with existing 
> models that don't contain a list of eos chars, and also doesn't work for 
> users who train with defaults, because they now get different results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1104) Fix images at the bottom of the Powered By page, and use lower cases for link

2017-06-30 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1104:

Fix Version/s: (was: 1.8.1)

> Fix images at the bottom of the Powered By page, and use lower cases for link
> -
>
> Key: OPENNLP-1104
> URL: https://issues.apache.org/jira/browse/OPENNLP-1104
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Website
>Reporter: Bruno P. Kinoshita
>Assignee: Bruno P. Kinoshita
>Priority: Minor
>  Labels: website
> Attachments: after.png, before.png
>
>
> The current Powered By page is the only page with upper case letter. Besides 
> keeping things concise, there are cases where using lower case URL's may be 
> helpful for SEO (though that's not so relevant for our project I think).
> The images at the bottom also are not being displayed. I didn't know, but 
> looks like in ASciiDoc you *must* include the [] 's, even if empty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1104) Fix images at the bottom of the Powered By page, and use lower cases for link

2017-06-30 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1104.
---
Resolution: Fixed

> Fix images at the bottom of the Powered By page, and use lower cases for link
> -
>
> Key: OPENNLP-1104
> URL: https://issues.apache.org/jira/browse/OPENNLP-1104
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Website
>Reporter: Bruno P. Kinoshita
>Assignee: Bruno P. Kinoshita
>Priority: Minor
>  Labels: website
> Attachments: after.png, before.png
>
>
> The current Powered By page is the only page with upper case letter. Besides 
> keeping things concise, there are cases where using lower case URL's may be 
> helpful for SEO (though that's not so relevant for our project I think).
> The images at the bottom also are not being displayed. I didn't know, but 
> looks like in ASciiDoc you *must* include the [] 's, even if empty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1103) Add AirNZ use case for OpenNLP to the web site

2017-06-30 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1103.
---
Resolution: Fixed

> Add AirNZ use case for OpenNLP to the web site
> --
>
> Key: OPENNLP-1103
> URL: https://issues.apache.org/jira/browse/OPENNLP-1103
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Website
>Reporter: Bruno P. Kinoshita
>Assignee: Bruno P. Kinoshita
>Priority: Minor
>  Labels: website
>
> Went to the Wynyard Quarter innovation week some weeks ago, and saw that 
> AirNZ was showing their bot and that it used OpenNLP. Spoke with Joey Faust, 
> Product Manager, and got the following testimonial for our site.
> {noformat}
> Air New Zealand uses OpenNLP to power its chatbot, Oscar. Launched in 
> February 2017, Oscar provides a conversational interface for customers to ask 
> questions about flights, amenities and policies. Using OpenNLP, we've been 
> able to consistently provide over 50% conversational success and support 
> hundreds of intents.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1103) Add AirNZ use case for OpenNLP to the web site

2017-06-30 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1103:

Fix Version/s: (was: 1.8.1)

> Add AirNZ use case for OpenNLP to the web site
> --
>
> Key: OPENNLP-1103
> URL: https://issues.apache.org/jira/browse/OPENNLP-1103
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Website
>Reporter: Bruno P. Kinoshita
>Assignee: Bruno P. Kinoshita
>Priority: Minor
>  Labels: website
>
> Went to the Wynyard Quarter innovation week some weeks ago, and saw that 
> AirNZ was showing their bot and that it used OpenNLP. Spoke with Joey Faust, 
> Product Manager, and got the following testimonial for our site.
> {noformat}
> Air New Zealand uses OpenNLP to power its chatbot, Oscar. Launched in 
> February 2017, Oscar provides a conversational interface for customers to ask 
> questions about flights, amenities and policies. Using OpenNLP, we've been 
> able to consistently provide over 50% conversational success and support 
> hundreds of intents.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1102) Universal Dependencies eval tests is failing

2017-06-30 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1102.
---
Resolution: Fixed

> Universal Dependencies eval tests is failing
> 
>
> Key: OPENNLP-1102
> URL: https://issues.apache.org/jira/browse/OPENNLP-1102
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
> Fix For: 1.8.1
>
>
> Results :
> Failed tests: 
>   UniversalDependency20Eval.trainAndEvalSpanishAncora:82 
> expected:<0.9046675934566091> but was:<0.9057341692068787>
> Tests in error: 
>   Conll00ChunkerEval.evalEnglishMaxentQn:95->train:55 » OutOfMemory Java heap 
> sp...
> Tests run: 774, Failures: 1, Errors: 1, Skipped: 1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (OPENNLP-1102) Universal Dependencies eval tests is failing

2017-06-30 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reopened OPENNLP-1102:
-
  Assignee: Joern Kottmann

> Universal Dependencies eval tests is failing
> 
>
> Key: OPENNLP-1102
> URL: https://issues.apache.org/jira/browse/OPENNLP-1102
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
> Fix For: 1.8.1
>
>
> Results :
> Failed tests: 
>   UniversalDependency20Eval.trainAndEvalSpanishAncora:82 
> expected:<0.9046675934566091> but was:<0.9057341692068787>
> Tests in error: 
>   Conll00ChunkerEval.evalEnglishMaxentQn:95->train:55 » OutOfMemory Java heap 
> sp...
> Tests run: 774, Failures: 1, Errors: 1, Skipped: 1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1105) Add profile for eval tests that need a lot of memory

2017-06-30 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1105.
---
Resolution: Fixed

> Add profile for eval tests that need a lot of memory
> 
>
> Key: OPENNLP-1105
> URL: https://issues.apache.org/jira/browse/OPENNLP-1105
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.8.1
>
>
> There are eval tests that need to much memory to be run together with the 
> other tests. They should be activated by a profile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1102) Universal Dependencies eval tests is failing

2017-06-30 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1102.
---
Resolution: Fixed

> Universal Dependencies eval tests is failing
> 
>
> Key: OPENNLP-1102
> URL: https://issues.apache.org/jira/browse/OPENNLP-1102
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
> Fix For: 1.8.1
>
>
> Results :
> Failed tests: 
>   UniversalDependency20Eval.trainAndEvalSpanishAncora:82 
> expected:<0.9046675934566091> but was:<0.9057341692068787>
> Tests in error: 
>   Conll00ChunkerEval.evalEnglishMaxentQn:95->train:55 » OutOfMemory Java heap 
> sp...
> Tests run: 774, Failures: 1, Errors: 1, Skipped: 1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1106) Update the coref code to compile against 1.6.0

2017-06-30 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1106:
---

 Summary: Update the coref code to compile against 1.6.0
 Key: OPENNLP-1106
 URL: https://issues.apache.org/jira/browse/OPENNLP-1106
 Project: OpenNLP
  Issue Type: Improvement
  Components: Coref
Reporter: Joern Kottmann
Assignee: Joern Kottmann


It would be nice if the coref code would compile against an older release 
version and gets the code a bit updated so it complies mostly with checkstyle 
rules.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1105) Add profile for eval tests that need a lot of memory

2017-06-30 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1105:
---

 Summary: Add profile for eval tests that need a lot of memory
 Key: OPENNLP-1105
 URL: https://issues.apache.org/jira/browse/OPENNLP-1105
 Project: OpenNLP
  Issue Type: Improvement
  Components: Build, Packaging and Test
Reporter: Joern Kottmann
Assignee: Joern Kottmann
Priority: Minor
 Fix For: 1.8.1


There are eval tests that need to much memory to be run together with the other 
tests. They should be activated by a profile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1102) Universal Dependencies eval tests is failing

2017-06-29 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1102:
---

 Summary: Universal Dependencies eval tests is failing
 Key: OPENNLP-1102
 URL: https://issues.apache.org/jira/browse/OPENNLP-1102
 Project: OpenNLP
  Issue Type: Bug
  Components: Build, Packaging and Test
Reporter: Joern Kottmann
 Fix For: 1.8.1


Results :

Failed tests: 
  UniversalDependency20Eval.trainAndEvalSpanishAncora:82 
expected:<0.9046675934566091> but was:<0.9057341692068787>
Tests in error: 
  Conll00ChunkerEval.evalEnglishMaxentQn:95->train:55 » OutOfMemory Java heap 
sp...

Tests run: 774, Failures: 1, Errors: 1, Skipped: 1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1017) OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException

2017-06-29 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1017.
---
Resolution: Cannot Reproduce

The issue - in case you train with enough data - is probably that you only have 
one document. Come to the mailing list and tell us about your data.

The name finder format says that you can use blank lines to mark document 
boundaries, you should have many documents in your training data. Common 
mistake is to train with only one document, that is ok for training but not for 
our eval tools.

Please reopen this in case there is some bug causing it.

> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException
> -
>
> Key: OPENNLP-1017
> URL: https://issues.apache.org/jira/browse/OPENNLP-1017
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Saurabh Jain
>
> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException.
> With nfold value 3, I tried to cross validate NameFinder training data. After 
> doing a research I got to know that first partition is assinged null data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1017) OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException

2017-06-29 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1017:

Fix Version/s: (was: 1.8.1)

> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException
> -
>
> Key: OPENNLP-1017
> URL: https://issues.apache.org/jira/browse/OPENNLP-1017
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: Saurabh Jain
>
> OpenNlp NameFinderCrossValidation gives InsufficientTrainingDataException.
> With nfold value 3, I tried to cross validate NameFinder training data. After 
> doing a research I got to know that first partition is assinged null data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1055) POSTagger.train causing error

2017-06-29 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069051#comment-16069051
 ] 

Joern Kottmann commented on OPENNLP-1055:
-

I would be curious to know more about the environment you are running that in. 
You should have the TrainUtil class as well in the opennlp-tools jar file.

> POSTagger.train causing error
> -
>
> Key: OPENNLP-1055
> URL: https://issues.apache.org/jira/browse/OPENNLP-1055
> Project: OpenNLP
>  Issue Type: Bug
>Reporter: zaheen mumtaz
>Assignee: Suneel Marthi
> Fix For: 1.8.1
>
>
> i am trying to create urdu POS tagger model. how to train my ur-pos.bin fille 
> according to my train data...i work on it alot but have error at line 
>  
> model = POSTaggerME.train("en", sampleStream, 
> TrainingParameters.defaultParams(), factory);
>  
> error is following:
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> opennlp/model/TrainUtil
>   at opennlp.tools.postag.POSTaggerME.train(POSTaggerME.java:332)
>   at 
> myproject.TaggerDictionaryTest.testTrainTaggerWithDictionary(TaggerDictionaryTest.java:98)
>   at myproject.TaggerDictionaryTest.main(TaggerDictionaryTest.java:40)
> Caused by: java.lang.ClassNotFoundException: opennlp.model.TrainUtil
>   at java.net.URLClassLoader.findClass(Unknown Source)
>   at java.lang.ClassLoader.loadClass(Unknown Source)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
>   at java.lang.ClassLoader.loadClass(Unknown Source)
>   ... 3 more
> how to resolve this error



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1091) Fixing issues found via FindBugs and warnings found via IDE

2017-06-28 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1091.
---
Resolution: Fixed

> Fixing issues found via FindBugs and warnings found via IDE
> ---
>
> Key: OPENNLP-1091
> URL: https://issues.apache.org/jira/browse/OPENNLP-1091
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.0
>Reporter: Bruno P. Kinoshita
>Assignee: Bruno P. Kinoshita
>Priority: Minor
>  Labels: findbugs, static-analysis, warnings
> Fix For: 1.8.1
>
>
> There are several issues that can be found using *FindBugs*.
> {noformat}
> mvn clean install findbugs:findbugs findbugs:gui
> {noformat}
> The _opennlp-tools_ is the only project with issues. Some are mere cosmetics, 
> or not so important. The pull request mentioned in this issue does not fix 
> all issues found, only the ones that I thought would be more important, and 
> that would not have huge impact in the code (i.e. would not have to change 
> much of the current behaviour/code base).
> Some changes are quite useful, such as optimizations that replace string 
> concatenation and use _Map#entrySet_ instead of _Map#keySet_ + another call 
> to _Map#get_. All the optimizations changes put together, I expect we should 
> see at least a few milliseconds improvement.
> Other changes are quite important, such as comparisons with 
> _Object.equals(anArray, anotherArray)_, which will compare two objects with 
> _==_, meaning that even when the arrays are equals, it would still return 
> false.
> In the pull request, I intentionally did not squash it now, as the second 
> commit include warnings found via the IDE (Eclipse in this case, but I 
> believe it's independent of the IDE). Such as _suppressWarnings_ that are not 
> necessary, and - the most importants - resource leak.
> This latter issue was fixed with Java8 try-with-resources, mainly in tests, 
> but also in some tools.
> Cheers
> Bruno



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (OPENNLP-1092) PosTagger serialization in namefinder model

2017-06-26 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reassigned OPENNLP-1092:
---

Assignee: Joern Kottmann

> PosTagger serialization in namefinder model
> ---
>
> Key: OPENNLP-1092
> URL: https://issues.apache.org/jira/browse/OPENNLP-1092
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: 1.8.0, 1.8.1
> Environment: Ubuntu 16.04 - Intel Core i7 6700k - Openjdk version 
> 1.8.0_131
>Reporter: Damiano Porta
>Assignee: Joern Kottmann
> Fix For: 1.8.1
>
>
> I am getting an error during the serialization of the post tagger inside a 
> name finder model.
> The error is: *java.lang.IllegalStateException: Missing serializer for 
> postagger.bin*
> I am having this problem via API and via cmd NameFinderTrainer tool.
> The command is:
> *opennlp TokenNameFinderTrainer -data /home/damiano/corpus.train -lang it 
> -model /home/damiano/model.bin -featuregen /home/damiano/test.xml 
> -sequenceCodec BIO -resources 
> /home/damiano/lavoro/java/Parser/src/main/resources/*
> {code}
> The output is:
> Writing name finder model ... Compressed 885605 parameters to 94030
> 3451 outcome patterns
> Exception in thread "main" java.lang.IllegalStateException: Missing 
> serializer for postagger.bin
>   at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:592)
>   at opennlp.tools.cmdline.CmdLineUtil.writeModel(CmdLineUtil.java:182)
>   at 
> opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNameFinderTrainerTool.java:188)
>   at opennlp.tools.cmdline.CLI.main(CLI.java:244)
> {code}
> My generators.xml is:
> {code:xml}
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  
> 
> 
> 
>  
> 
> 
> 
>   
> 
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1096) Optimize n-gram creation loop for CPU cache usage

2017-06-26 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1096.
---
Resolution: Fixed

> Optimize n-gram creation loop for CPU cache usage
> -
>
> Key: OPENNLP-1096
> URL: https://issues.apache.org/jira/browse/OPENNLP-1096
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Trivial
> Fix For: 1.8.1
>
>
> There are two for loops to read the string and calculate n-grams, the loops 
> should be turned around to be more cache friendly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-936) Add thread safe versions of some tools (ME sentence detection, tokenization, pos tagging)

2017-06-26 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-936:
---
Fix Version/s: (was: 1.8.1)
   1.8.2

> Add thread safe versions of some tools (ME sentence detection, tokenization, 
> pos tagging)
> -
>
> Key: OPENNLP-936
> URL: https://issues.apache.org/jira/browse/OPENNLP-936
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 1.7.1
>Reporter: Thilo Goetz
>Priority: Minor
> Fix For: 1.8.2
>
>
> As discussed on the mailing list, add thread safe versions of maximum entropy 
> sentence detection, tokenization and pos tagging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1091) Fixing issues found via FindBugs and warnings found via IDE

2017-06-26 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1091:

Fix Version/s: 1.8.1

> Fixing issues found via FindBugs and warnings found via IDE
> ---
>
> Key: OPENNLP-1091
> URL: https://issues.apache.org/jira/browse/OPENNLP-1091
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 1.8.0
>Reporter: Bruno P. Kinoshita
>Assignee: Bruno P. Kinoshita
>Priority: Minor
>  Labels: findbugs, static-analysis, warnings
> Fix For: 1.8.1
>
>
> There are several issues that can be found using *FindBugs*.
> {noformat}
> mvn clean install findbugs:findbugs findbugs:gui
> {noformat}
> The _opennlp-tools_ is the only project with issues. Some are mere cosmetics, 
> or not so important. The pull request mentioned in this issue does not fix 
> all issues found, only the ones that I thought would be more important, and 
> that would not have huge impact in the code (i.e. would not have to change 
> much of the current behaviour/code base).
> Some changes are quite useful, such as optimizations that replace string 
> concatenation and use _Map#entrySet_ instead of _Map#keySet_ + another call 
> to _Map#get_. All the optimizations changes put together, I expect we should 
> see at least a few milliseconds improvement.
> Other changes are quite important, such as comparisons with 
> _Object.equals(anArray, anotherArray)_, which will compare two objects with 
> _==_, meaning that even when the arrays are equals, it would still return 
> false.
> In the pull request, I intentionally did not squash it now, as the second 
> commit include warnings found via the IDE (Eclipse in this case, but I 
> believe it's independent of the IDE). Such as _suppressWarnings_ that are not 
> necessary, and - the most importants - resource leak.
> This latter issue was fixed with Java8 try-with-resources, mainly in tests, 
> but also in some tools.
> Cheers
> Bruno



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1097) Enable language detector normalizers by default

2017-06-22 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1097.
---
Resolution: Fixed

> Enable language detector normalizers by default
> ---
>
> Key: OPENNLP-1097
> URL: https://issues.apache.org/jira/browse/OPENNLP-1097
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
> Fix For: 1.8.1
>
>
> The normalizes should be used by default in the langdetect context generator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1097) Enable language detector normalizers by default

2017-06-22 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1097:
---

 Summary: Enable language detector normalizers by default
 Key: OPENNLP-1097
 URL: https://issues.apache.org/jira/browse/OPENNLP-1097
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann
Assignee: Joern Kottmann
Priority: Minor
 Fix For: 1.8.1


The normalizes should be used by default in the langdetect context generator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OPENNLP-1096) Optimize n-gram creation loop for CPU cache usage

2017-06-22 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-1096:
---

 Summary: Optimize n-gram creation loop for CPU cache usage
 Key: OPENNLP-1096
 URL: https://issues.apache.org/jira/browse/OPENNLP-1096
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann
Assignee: Joern Kottmann
Priority: Trivial
 Fix For: 1.8.1


There are two for loops to read the string and calculate n-grams, the loops 
should be turned around to be more cache friendly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1082) SentenceSampleStream should add EOS to samples if missing

2017-06-22 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-1082.
---
Resolution: Fixed

> SentenceSampleStream should add EOS to samples if missing
> -
>
> Key: OPENNLP-1082
> URL: https://issues.apache.org/jira/browse/OPENNLP-1082
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector
>Reporter: William Colen
>Assignee: William Colen
> Fix For: 1.8.1
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1080) TwoPassDataIndexer should use binary format

2017-06-21 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1080:

Fix Version/s: (was: 1.8.1)

> TwoPassDataIndexer should use binary format
> ---
>
> Key: OPENNLP-1080
> URL: https://issues.apache.org/jira/browse/OPENNLP-1080
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Minor
>
> The TwoPassDataIndexer should use a binary format instead of the current 
> string format to improve read/write performance of the events. Further it 
> should use a check-sum to guarantee that the data read back from disk is the 
> same as was written.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1089) Reduce printing of iterations in JUnit tests

2017-06-21 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann updated OPENNLP-1089:

Fix Version/s: (was: 1.8.1)

> Reduce printing of iterations in JUnit tests
> 
>
> Key: OPENNLP-1089
> URL: https://issues.apache.org/jira/browse/OPENNLP-1089
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Build, Packaging and Test
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Trivial
>
> The tests now run in parallel and a lot of iterations are printed totally 
> mixed up, making this kind of output useless. There is also not much value in 
> printing all the iterations to the console.
> Lets reduce the amount of iterations for training tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   3   4   5   6   7   8   9   10   >