[jira] [Updated] (OPENNLP-1314) ConlluWordLine to print line contents when throwing format error

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1314:

Fix Version/s: 2.1.0

> ConlluWordLine to print line contents when throwing format error
> 
>
> Key: OPENNLP-1314
> URL: https://issues.apache.org/jira/browse/OPENNLP-1314
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: 1.9.3
>Reporter: Markus Jelsma
>Assignee: Martin Wiesner
>Priority: Trivial
> Fix For: 2.1.0
>
> Attachments: OPENNLP-1314.patch
>
>
> Exception thrown for edit/formatting errors is not helpful in debugging. This 
> tiny patch makes my day much easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1314) ConlluWordLine to print line contents when throwing format error

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1314.
-
  Assignee: Martin Wiesner
Resolution: Fixed

This was resolved by commit 8069a734 in the context of OPENNLP-1372.

Closing as duplicate.

> ConlluWordLine to print line contents when throwing format error
> 
>
> Key: OPENNLP-1314
> URL: https://issues.apache.org/jira/browse/OPENNLP-1314
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: 1.9.3
>Reporter: Markus Jelsma
>Assignee: Martin Wiesner
>Priority: Trivial
> Attachments: OPENNLP-1314.patch
>
>
> Exception thrown for edit/formatting errors is not helpful in debugging. This 
> tiny patch makes my day much easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-731) SAMpLes

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-731:
---
Issue Type: Wish  (was: Improvement)

> SAMpLes
> ---
>
> Key: OPENNLP-731
> URL: https://issues.apache.org/jira/browse/OPENNLP-731
> Project: OpenNLP
>  Issue Type: Wish
> Environment: Java on Linux OS
>Reporter: Giuseppe Laurenza
>Priority: Minor
>  Labels: features
>
> Sentiment Analysis with MultiPle LanguagES [SAMpLes] is an engine that 
> provided algorithms for sentiment analysis, voting phrases with a range from 
> 1,0 to 5,0. The particularity of this engine is that it use "Online-Reviews 
> vith a vote" to build the dictionary. It comes with two complete dictionaries 
> for english and italian languages and  with the functions to generate a 
> custom dictionary using a personal dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-731) Integrate SAMpLes engine

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-731:
---
Summary: Integrate SAMpLes engine  (was: Integrate SAMpLes)

> Integrate SAMpLes engine
> 
>
> Key: OPENNLP-731
> URL: https://issues.apache.org/jira/browse/OPENNLP-731
> Project: OpenNLP
>  Issue Type: Wish
> Environment: Java on Linux OS
>Reporter: Giuseppe Laurenza
>Priority: Minor
>  Labels: features
>
> Sentiment Analysis with MultiPle LanguagES [SAMpLes] is an engine that 
> provided algorithms for sentiment analysis, voting phrases with a range from 
> 1,0 to 5,0. The particularity of this engine is that it use "Online-Reviews 
> vith a vote" to build the dictionary. It comes with two complete dictionaries 
> for english and italian languages and  with the functions to generate a 
> custom dictionary using a personal dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-731) Integrate SAMpLes

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-731:
---
Summary: Integrate SAMpLes  (was: SAMpLes)

> Integrate SAMpLes
> -
>
> Key: OPENNLP-731
> URL: https://issues.apache.org/jira/browse/OPENNLP-731
> Project: OpenNLP
>  Issue Type: Wish
> Environment: Java on Linux OS
>Reporter: Giuseppe Laurenza
>Priority: Minor
>  Labels: features
>
> Sentiment Analysis with MultiPle LanguagES [SAMpLes] is an engine that 
> provided algorithms for sentiment analysis, voting phrases with a range from 
> 1,0 to 5,0. The particularity of this engine is that it use "Online-Reviews 
> vith a vote" to build the dictionary. It comes with two complete dictionaries 
> for english and italian languages and  with the functions to generate a 
> custom dictionary using a personal dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-504) Add a FAQ page to our site

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-504.

Resolution: Fixed

The FAQ is available here: [https://opennlp.apache.org/faq.html]

and has the content proposed/discussed.

> Add a FAQ page to our site
> --
>
> Key: OPENNLP-504
> URL: https://issues.apache.org/jira/browse/OPENNLP-504
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Website
>Reporter: James Kosin
>Assignee: Bruno P. Kinoshita
>Priority: Minor
>  Labels: FAQ, newbie
> Attachments: opennlp-faq-wip-20170512-fullpage.png
>
>
> Collect and assemble a FAQ page for our site.
> Most questions start out:
>   Where can I get the models?
>   Where do I start getting to know OpenNLP?
>   etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1345) The Training API code for Sentence Detection is outdated in manual

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1345.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

This is fixed with commit 6c4dc364 in the context of OPENNLP-1362.

Since this change, the documentation is consistent with the API again.

> The Training API code for Sentence Detection is outdated in manual
> --
>
> Key: OPENNLP-1345
> URL: https://issues.apache.org/jira/browse/OPENNLP-1345
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.4
>Reporter: Phillip Rhodes
>Priority: Minor
>  Labels: documentation, easy-fix
> Fix For: 2.0.0
>
>
> The Training API example code at 
> [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html] in the section on 
> Sentence Detection training is incorrect. The current code sample is:
> {code:java}
> ObjectStream lineStream =
>   new PlainTextByLineStream(new FileInputStream("en-sent.train"), 
> StandardCharsets.UTF_8);
>  {code}
> But PlainTextByLineStream no longer takes an InputStream as the first 
> argument to its constructor. It now requires an InputStreamFactory.
> NOTE: this same pattern reappears in multiple places in the current manual. 
> See also, OPENNLP-1319



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1346) The Training API code for Tokenization is outdated in manual (1/2)

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1346.
-
Fix Version/s: 2.0.0
 Assignee: Martin Wiesner
   Resolution: Fixed

This is fixed with commit 6c4dc364 in the context of OPENNLP-1362.

Since this change, the documentation is consistent with the API again.

> The Training API code for Tokenization is outdated in manual (1/2)
> --
>
> Key: OPENNLP-1346
> URL: https://issues.apache.org/jira/browse/OPENNLP-1346
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.4
>Reporter: Phillip Rhodes
>Assignee: Martin Wiesner
>Priority: Minor
>  Labels: documentation, easy-fix
> Fix For: 2.0.0
>
>
> The Training API example code at 
> [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html] in the section 
> dealing with Tokenizer training  incorrect. The current code sample is:
> {code:java}
> ObjectStream lineStream = new PlainTextByLineStream(new 
> FileInputStream("en-sent.train"),
> StandardCharsets.UTF_8);{code}
> But PlainTextByLineStream no longer takes an InputStream as the first 
> argument to its constructor. It now requires an InputStreamFactory.
> NOTE: this same pattern reappears in multiple places in the current manual. 
> See also, OPENNLP-1319 and OPENNLP-1345
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1349) The Training API code for Document Categorization is outdated in manual

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1349.
-
Fix Version/s: 2.0.0
 Assignee: Martin Wiesner
   Resolution: Fixed

This is fixed with commit 6c4dc364 in the context of OPENNLP-1362.

Since the change, the documentation is consistent with the API again.

> The Training API code for Document Categorization is outdated in manual
> ---
>
> Key: OPENNLP-1349
> URL: https://issues.apache.org/jira/browse/OPENNLP-1349
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.4
>Reporter: Phillip Rhodes
>Assignee: Martin Wiesner
>Priority: Minor
>  Labels: documentation, easy-fix
> Fix For: 2.0.0
>
>
> The Training API example code at 
> [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html] in the section 
> dealing with TokenNameFinder training  incorrect. The current code sample 
> includes:
> {code:java}
> try (dataIn = new FileInputStream("en-sentiment.train")) {
>   ObjectStream lineStream =
>   new PlainTextByLineStream(dataIn, StandardCharsets.UTF_8);
> }{code}
> But PlainTextByLineStream no longer takes an InputStream as the first 
> argument to its constructor. It now requires an InputStreamFactory.
> NOTE: this same pattern reappears in multiple places in the current manual. 
> See also, OPENNLP-1319,  OPENNLP-1345, and OPENNLP-1346 among others.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1348) The Training API code for NamedEntityRecognition is outdated in manual

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1348.
-
Fix Version/s: 2.0.0
 Assignee: Martin Wiesner
   Resolution: Fixed

This is fixed with commit 6c4dc364 in the context of OPENNLP-1362.

Since the change, the documentation is consistent with the API again.

> The Training API code for NamedEntityRecognition is outdated in manual
> --
>
> Key: OPENNLP-1348
> URL: https://issues.apache.org/jira/browse/OPENNLP-1348
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.4
>Reporter: Phillip Rhodes
>Assignee: Martin Wiesner
>Priority: Minor
>  Labels: documentation, easy-fix
> Fix For: 2.0.0
>
>
> The Training API example code at 
> [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html] in the section 
> dealing with TokenNameFinder training  incorrect. The current code sample 
> includes:
> {code:java}
> ObjectStream lineStream = new PlainTextByLineStream(new 
> FileInputStream("en-sent.train"),
> StandardCharsets.UTF_8);{code}
> But PlainTextByLineStream no longer takes an InputStream as the first 
> argument to its constructor. It now requires an InputStreamFactory.
> NOTE: this same pattern reappears in multiple places in the current manual. 
> See also, OPENNLP-1319, and OPENNLP-1345, and OPENNLP-1346
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1347) The Training API code for Tokenization is outdated in manual (2/2)

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1347.
-
Fix Version/s: 2.0.0
 Assignee: Martin Wiesner
   Resolution: Fixed

This is fixed with commit 6c439ae9 in the context of OPENNLP-1319.

Since the change, the documentation is consistent with the API again.

> The Training API code for Tokenization is outdated in manual (2/2)
> --
>
> Key: OPENNLP-1347
> URL: https://issues.apache.org/jira/browse/OPENNLP-1347
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.4
>Reporter: Phillip Rhodes
>Assignee: Martin Wiesner
>Priority: Minor
>  Labels: documentation, easy-fix
> Fix For: 2.0.0
>
>
> The code sample in the manual at <> in the section on Tokenizer training has 
> is incorrect. The current code sample is:
>  
> {code:java}
> try {
>   model = TokenizerME.train("en", sampleStream, true, 
> TrainingParameters.defaultParams());
> } {code}
> But TokenizerME.train() now has a new signature which requires a 
> TokenizerFactory. The above does not compile with the 1.9.4 library version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OPENNLP-1357) Use CharSequence to allow for memory management

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reassigned OPENNLP-1357:
---

Assignee: Martin Wiesner

> Use CharSequence to allow for memory management
> ---
>
> Key: OPENNLP-1357
> URL: https://issues.apache.org/jira/browse/OPENNLP-1357
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector
>Affects Versions: 1.9.4
>Reporter: Paul Austin
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.1.1
>
>
> Most of the classes in OpenNLP require the inputs to be as String, 
> StringBuffer, or char[]. This means that you have to load all the data into 
> memory.
> Many of these cases (String and StringBuffer args) could be replaced with a 
> single method that accepts CharSequence as a parameter.
> For example DefaultEndOfSentenceScanner
>  
> {code:java}
>  public List getPositions(CharSequence s) {
>     List l = new ArrayList<>();
>     for (int i = 0; i < s.length(); i++) {
>   char c = s.charAt(i);
>       if (eosCharacters.contains(c)) {
>         l.add(i);
>       }
>     }
>     return l;
>   }
> {code}
> This would allow for users to manage the memory overhead for large data sets. 
> And in some cases require less temporary memory conversion to char buffers.
> Some code such as the SDContextGenerator already uses CharSequence.  However 
> in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. 
> The sb isn't modified and the SDContextGenerator.getContext takes 
> CharSequence as an arg and String is a CharSequence.
>  
> {code:java}
> public Span[] sentPosDetect(String s) {
>     sentProbs.clear();
>     StringBuffer sb = new StringBuffer(s);{code}
>  
> I can create a pull request(s) for the above if you think it is useful.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1344) Broken link (404) for Leipzig corpora in OpenNLP Manual

2022-12-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645383#comment-17645383
 ] 

ASF GitHub Bot commented on OPENNLP-1344:
-

mawiesne opened a new pull request, #454:
URL: https://github.com/apache/opennlp/pull/454

   Change
   -
   - adjusts URL to Leipzig corpora in corresponding `langdetect.xml`, as 
proposed by reporter 'P. Rhodes'
   - verified the new URL works as expected
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Broken link (404) for Leipzig corpora in OpenNLP Manual
> ---
>
> Key: OPENNLP-1344
> URL: https://issues.apache.org/jira/browse/OPENNLP-1344
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Phillip Rhodes
>Assignee: Martin Wiesner
>Priority: Minor
>  Labels: documentation, easyfix
> Fix For: 2.1.1
>
>
> In the User Manual at:
> https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html
> The download link for the Leipzig Corpora is listed as:
> https://corpora.uni-leipzig.de/download.html
> however this link returns a 404 Not Found error. The correct link now appears 
> to be:
> [https://wortschatz.uni-leipzig.de/en/download]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1344) Broken link (404) for Leipzig corpora in OpenNLP Manual

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1344:

Fix Version/s: 2.1.1

> Broken link (404) for Leipzig corpora in OpenNLP Manual
> ---
>
> Key: OPENNLP-1344
> URL: https://issues.apache.org/jira/browse/OPENNLP-1344
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Phillip Rhodes
>Priority: Minor
>  Labels: documentation, easyfix
> Fix For: 2.1.1
>
>
> In the User Manual at:
> https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html
> The download link for the Leipzig Corpora is listed as:
> https://corpora.uni-leipzig.de/download.html
> however this link returns a 404 Not Found error. The correct link now appears 
> to be:
> [https://wortschatz.uni-leipzig.de/en/download]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OPENNLP-1344) Broken link (404) for Leipzig corpora in OpenNLP Manual

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reassigned OPENNLP-1344:
---

Assignee: Martin Wiesner

> Broken link (404) for Leipzig corpora in OpenNLP Manual
> ---
>
> Key: OPENNLP-1344
> URL: https://issues.apache.org/jira/browse/OPENNLP-1344
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Phillip Rhodes
>Assignee: Martin Wiesner
>Priority: Minor
>  Labels: documentation, easyfix
> Fix For: 2.1.1
>
>
> In the User Manual at:
> https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html
> The download link for the Leipzig Corpora is listed as:
> https://corpora.uni-leipzig.de/download.html
> however this link returns a 404 Not Found error. The correct link now appears 
> to be:
> [https://wortschatz.uni-leipzig.de/en/download]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1344) Broken link (404) for Leipzig corpora in OpenNLP Manual

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1344:

Issue Type: Documentation  (was: Bug)

> Broken link (404) for Leipzig corpora in OpenNLP Manual
> ---
>
> Key: OPENNLP-1344
> URL: https://issues.apache.org/jira/browse/OPENNLP-1344
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Phillip Rhodes
>Priority: Minor
>  Labels: documentation, easyfix
>
> In the User Manual at:
> https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html
> The download link for the Leipzig Corpora is listed as:
> https://corpora.uni-leipzig.de/download.html
> however this link returns a 404 Not Found error. The correct link now appears 
> to be:
> [https://wortschatz.uni-leipzig.de/en/download]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1357) Use CharSequence to allow for memory management

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1357:

Issue Type: Improvement  (was: New Feature)

> Use CharSequence to allow for memory management
> ---
>
> Key: OPENNLP-1357
> URL: https://issues.apache.org/jira/browse/OPENNLP-1357
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector
>Affects Versions: 1.9.4
>Reporter: Paul Austin
>Priority: Minor
> Fix For: 2.1.1
>
>
> Most of the classes in OpenNLP require the inputs to be as String, 
> StringBuffer, or char[]. This means that you have to load all the data into 
> memory.
> Many of these cases (String and StringBuffer args) could be replaced with a 
> single method that accepts CharSequence as a parameter.
> For example DefaultEndOfSentenceScanner
>  
> {code:java}
>  public List getPositions(CharSequence s) {
>     List l = new ArrayList<>();
>     for (int i = 0; i < s.length(); i++) {
>   char c = s.charAt(i);
>       if (eosCharacters.contains(c)) {
>         l.add(i);
>       }
>     }
>     return l;
>   }
> {code}
> This would allow for users to manage the memory overhead for large data sets. 
> And in some cases require less temporary memory conversion to char buffers.
> Some code such as the SDContextGenerator already uses CharSequence.  However 
> in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. 
> The sb isn't modified and the SDContextGenerator.getContext takes 
> CharSequence as an arg and String is a CharSequence.
>  
> {code:java}
> public Span[] sentPosDetect(String s) {
>     sentProbs.clear();
>     StringBuffer sb = new StringBuffer(s);{code}
>  
> I can create a pull request(s) for the above if you think it is useful.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1357) Use CharSequence to allow for memory management

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1357:

Fix Version/s: 2.1.1

> Use CharSequence to allow for memory management
> ---
>
> Key: OPENNLP-1357
> URL: https://issues.apache.org/jira/browse/OPENNLP-1357
> Project: OpenNLP
>  Issue Type: New Feature
>  Components: Sentence Detector
>Affects Versions: 1.9.4
>Reporter: Paul Austin
>Priority: Minor
> Fix For: 2.1.1
>
>
> Most of the classes in OpenNLP require the inputs to be as String, 
> StringBuffer, or char[]. This means that you have to load all the data into 
> memory.
> Many of these cases (String and StringBuffer args) could be replaced with a 
> single method that accepts CharSequence as a parameter.
> For example DefaultEndOfSentenceScanner
>  
> {code:java}
>  public List getPositions(CharSequence s) {
>     List l = new ArrayList<>();
>     for (int i = 0; i < s.length(); i++) {
>   char c = s.charAt(i);
>       if (eosCharacters.contains(c)) {
>         l.add(i);
>       }
>     }
>     return l;
>   }
> {code}
> This would allow for users to manage the memory overhead for large data sets. 
> And in some cases require less temporary memory conversion to char buffers.
> Some code such as the SDContextGenerator already uses CharSequence.  However 
> in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. 
> The sb isn't modified and the SDContextGenerator.getContext takes 
> CharSequence as an arg and String is a CharSequence.
>  
> {code:java}
> public Span[] sentPosDetect(String s) {
>     sentProbs.clear();
>     StringBuffer sb = new StringBuffer(s);{code}
>  
> I can create a pull request(s) for the above if you think it is useful.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1357) Use CharSequence to allow for memory management

2022-12-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645376#comment-17645376
 ] 

ASF GitHub Bot commented on OPENNLP-1357:
-

mawiesne opened a new pull request, #453:
URL: https://github.com/apache/opennlp/pull/453

   Change
   -
   - adjusts method signatures in `SentenceDetector` and `EndOfSentenceScanner` 
to use CharSequence` as proposed by reporter 'P. Austin'
   - adapts existing impl classes to work (fine) with this change, see comments 
in OPENNLP-1357
   - adjusts JavaDoc accordingly
   - adds 'Override' annotations in some spots where they were missing
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Use CharSequence to allow for memory management
> ---
>
> Key: OPENNLP-1357
> URL: https://issues.apache.org/jira/browse/OPENNLP-1357
> Project: OpenNLP
>  Issue Type: New Feature
>  Components: Sentence Detector
>Affects Versions: 1.9.4
>Reporter: Paul Austin
>Priority: Minor
>
> Most of the classes in OpenNLP require the inputs to be as String, 
> StringBuffer, or char[]. This means that you have to load all the data into 
> memory.
> Many of these cases (String and StringBuffer args) could be replaced with a 
> single method that accepts CharSequence as a parameter.
> For example DefaultEndOfSentenceScanner
>  
> {code:java}
>  public List getPositions(CharSequence s) {
>     List l = new ArrayList<>();
>     for (int i = 0; i < s.length(); i++) {
>   char c = s.charAt(i);
>       if (eosCharacters.contains(c)) {
>         l.add(i);
>       }
>     }
>     return l;
>   }
> {code}
> This would allow for users to manage the memory overhead for large data sets. 
> And in some cases require less temporary memory conversion to char buffers.
> Some code such as the SDContextGenerator already uses CharSequence.  However 
> in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. 
> The sb isn't modified and the SDContextGenerator.getContext takes 
> CharSequence as an arg and String is a CharSequence.
>  
> {code:java}
> public Span[] sentPosDetect(String s) {
>     sentProbs.clear();
>     StringBuffer sb = new StringBuffer(s);{code}
>  
> I can create a pull request(s) for the above if you think it is useful.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1363) Verify the documentation of the lemmatizer input format

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1363:

Issue Type: Documentation  (was: Task)

> Verify the documentation of the lemmatizer input format
> ---
>
> Key: OPENNLP-1363
> URL: https://issues.apache.org/jira/browse/OPENNLP-1363
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Jeff Zemerick
>Priority: Minor
>
> In OPENNLP-1257, a change was proposed to update the code to split the 
> lemmatizer input by spaces instead of by tab. I believe tab is the desired 
> delimiter but we need to make sure the documentation is consistent.
> Refer to 
> [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer|https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer.]
>  , in particular the following sentences:
> "The training data consist of three columns separated by spaces. Each word 
> has been put on a separate line and there is an empty line after each 
> sentence. The first column contains the current word, the second its 
> part-of-speech tag and the third its lemma. Here is an example of the file 
> format:"
> Determine if that first line should read "separated by tabs" instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1363) Verify the documentation of the lemmatizer input format

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1363:

Affects Version/s: 2.1.0

> Verify the documentation of the lemmatizer input format
> ---
>
> Key: OPENNLP-1363
> URL: https://issues.apache.org/jira/browse/OPENNLP-1363
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Jeff Zemerick
>Priority: Minor
>
> In OPENNLP-1257, a change was proposed to update the code to split the 
> lemmatizer input by spaces instead of by tab. I believe tab is the desired 
> delimiter but we need to make sure the documentation is consistent.
> Refer to 
> [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer|https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer.]
>  , in particular the following sentences:
> "The training data consist of three columns separated by spaces. Each word 
> has been put on a separate line and there is an empty line after each 
> sentence. The first column contains the current word, the second its 
> part-of-speech tag and the third its lemma. Here is an example of the file 
> format:"
> Determine if that first line should read "separated by tabs" instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1135) Remove support for OSGi

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1135:

Issue Type: Wish  (was: Improvement)

> Remove support for OSGi
> ---
>
> Key: OPENNLP-1135
> URL: https://issues.apache.org/jira/browse/OPENNLP-1135
> Project: OpenNLP
>  Issue Type: Wish
>Reporter: Jörn Kottmann
>Priority: Minor
>
> Remove the OSGi bundle support from the opennlp-tools jar. OSGi isn't used 
> widely and the ones who are using it know how to use opennlp-tools in an OSGi 
> environment anyway by applying some build tricks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-705) integrate Similarity with VerbNet

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-705:
---
Issue Type: Wish  (was: Bug)

> integrate Similarity with VerbNet
> -
>
> Key: OPENNLP-705
> URL: https://issues.apache.org/jira/browse/OPENNLP-705
> Project: OpenNLP
>  Issue Type: Wish
>Reporter: Boris Galitsky
>Assignee: Boris Galitsky
>Priority: Major
>
> When matching two parse trees, features of the verbs other than POS needs to 
> be taken into account for more accurate matching



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-509) opennlp.tools.parser.Parse.getParent() returning incorrect object

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-509:
---
Issue Type: Bug  (was: Improvement)

> opennlp.tools.parser.Parse.getParent() returning incorrect object
> -
>
> Key: OPENNLP-509
> URL: https://issues.apache.org/jira/browse/OPENNLP-509
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Parser
>Affects Versions: tools-1.5.2-incubating
> Environment: OpenNLP invoked from C# (.Net 4) via IKVM ikvm-7.0.4335.0
> OpenNLP library dll created using the command: ikvmc -target:library 
> -assembly:opennlp %OPENNLP_LIB_PATH%\opennlp-maxent-3.0.2-incubating.jar 
> %OPENNLP_LIB_PATH%\jwnl-1.3.3.jar 
> %OPENNLP_LIB_PATH%\opennlp-tools-1.5.2-incubating.jar
>Reporter: Ofer Tal
>Priority: Major
>
> After parsing a sentence with opennlp.tools.parser.Parse.parse() some (many) 
> Parse children do not have the correct parent set.
> Details:
> given a Parse node in the tree (let's assume it is in a variable named p)
> When iterating over the Parse[] returned by p.getChildren(), checking 
> p.equals(children[i].getParent()) returns false in many, if not all of the 
> nodes.
> More background --
> to create the parse tree, I used the code:
> {code}
> opennlp.tools.parser.Parse p = new opennlp.tools.parser.Parse(parseSentence, 
> new opennlp.tools.util.Span(0, parseSentence.Length), 
> opennlp.tools.parser.AbstractBottomUpParser.INC_NODE, 1, null);
> // create a parse object for each token and add it to the parent
> int start = 0;
> foreach (string token in tokenizedSentence)
> {
> {
> opennlp.tools.parser.Parse tokenParse = new 
> opennlp.tools.parser.Parse(parseSentence,
> new 
> opennlp.tools.util.Span(start, start + token.Length),
> 
> opennlp.tools.parser.AbstractBottomUpParser.TOK_NODE,
> 0,
> 0);   
>  
> p.insert(tokenParse);
> start += token.Length + 1;
> }
> }
> // fetch 1 possible parse trees
> opennlp.tools.parser.Parse[] parses = parser.parse(p, 1);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1187) Issue in finding accuracy of model

2022-12-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645353#comment-17645353
 ] 

ASF GitHub Bot commented on OPENNLP-1187:
-

mawiesne opened a new pull request, #452:
URL: https://github.com/apache/opennlp/pull/452

   Change
   -
   - adds `NaiveBayesEvalParameters` by computation of the outcome total 
values, as proposed by reporter 'agarg98'
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Issue in finding accuracy of model
> --
>
> Key: OPENNLP-1187
> URL: https://issues.apache.org/jira/browse/OPENNLP-1187
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Doccat, Machine Learning
>Affects Versions: 1.8.4
>Reporter: Aman Garg
>Priority: Major
> Fix For: 2.1.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> the trainingStats function in NaiveBayesTrainer class is not working properly 
> and display wrong result.
> In findParameters(), at line 154 i.e. 
> EvalParameters evalParams = new EvalParameters(params, numOutcomes);
> should be replaced by following block:
>  
> double[] outcomeTotals = new double[outcomeLabels.length];
>     for (int i = 0; i < params.length; ++i) {
>       Context context = params[i];
>       for (int j = 0; j < context.getOutcomes().length; ++j) {
>         int outcome = context.getOutcomes()[j];
>         double count = context.getParameters()[j];
>         outcomeTotals[outcome] += count;
>       }
>     }
> evalParams = new NaiveBayesEvalParameters(params,
> outcomeLabels.length, outcomeTotals, predLabels.length);



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OPENNLP-1187) Issue in finding accuracy of model

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reassigned OPENNLP-1187:
---

Assignee: Martin Wiesner

> Issue in finding accuracy of model
> --
>
> Key: OPENNLP-1187
> URL: https://issues.apache.org/jira/browse/OPENNLP-1187
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Doccat, Machine Learning
>Affects Versions: 1.8.4
>Reporter: Aman Garg
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.1.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> the trainingStats function in NaiveBayesTrainer class is not working properly 
> and display wrong result.
> In findParameters(), at line 154 i.e. 
> EvalParameters evalParams = new EvalParameters(params, numOutcomes);
> should be replaced by following block:
>  
> double[] outcomeTotals = new double[outcomeLabels.length];
>     for (int i = 0; i < params.length; ++i) {
>       Context context = params[i];
>       for (int j = 0; j < context.getOutcomes().length; ++j) {
>         int outcome = context.getOutcomes()[j];
>         double count = context.getParameters()[j];
>         outcomeTotals[outcome] += count;
>       }
>     }
> evalParams = new NaiveBayesEvalParameters(params,
> outcomeLabels.length, outcomeTotals, predLabels.length);



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1187) Issue in finding accuracy of model

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1187:

Fix Version/s: 2.1.1

> Issue in finding accuracy of model
> --
>
> Key: OPENNLP-1187
> URL: https://issues.apache.org/jira/browse/OPENNLP-1187
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Doccat, Machine Learning
>Affects Versions: 1.8.4
>Reporter: Aman Garg
>Priority: Major
> Fix For: 2.1.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> the trainingStats function in NaiveBayesTrainer class is not working properly 
> and display wrong result.
> In findParameters(), at line 154 i.e. 
> EvalParameters evalParams = new EvalParameters(params, numOutcomes);
> should be replaced by following block:
>  
> double[] outcomeTotals = new double[outcomeLabels.length];
>     for (int i = 0; i < params.length; ++i) {
>       Context context = params[i];
>       for (int j = 0; j < context.getOutcomes().length; ++j) {
>         int outcome = context.getOutcomes()[j];
>         double count = context.getParameters()[j];
>         outcomeTotals[outcome] += count;
>       }
>     }
> evalParams = new NaiveBayesEvalParameters(params,
> outcomeLabels.length, outcomeTotals, predLabels.length);



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1182) LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1182:

Priority: Minor  (was: Major)

> LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
> ---
>
> Key: OPENNLP-1182
> URL: https://issues.apache.org/jira/browse/OPENNLP-1182
> Project: OpenNLP
>  Issue Type: Bug
>Affects Versions: 1.8.4
>Reporter: Steven Rowe
>Priority: Minor
>
> Contrary to the docs (see below), LanguageDetectorConverterTool doesn't 
> actually do anything at all; the class is empty.
> {quote}
> The following sequence of commands shows how to convert the Leipzig Corpora 
> collection at folder leipzig-train/ to the default Language Detector format, 
> by creating groups of 5 sentences as documents and limiting to 1 
> documents per language. Them, it shuffles the result and select the first 
> 10 lines as train corpus and the last 2 as evaluation corpus:
> {noformat}
> $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ 
> -sentencesPerSample 5 -samplesPerLanguage 1 > leipzig.txt
> $ perl -MList::Util=shuffle -e 'print shuffle();' < leipzig.txt > 
> leipzig_shuf.txt
> $ head -10 < leipzig_shuf.txt > leipzig.train
> $ tail -2 < leipzig_shuf.txt > leipzig.eval
> {noformat}
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1182) LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1182:

Component/s: Language Detector

> LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
> ---
>
> Key: OPENNLP-1182
> URL: https://issues.apache.org/jira/browse/OPENNLP-1182
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Language Detector
>Affects Versions: 1.8.4
>Reporter: Steven Rowe
>Priority: Minor
>
> Contrary to the docs (see below), LanguageDetectorConverterTool doesn't 
> actually do anything at all; the class is empty.
> {quote}
> The following sequence of commands shows how to convert the Leipzig Corpora 
> collection at folder leipzig-train/ to the default Language Detector format, 
> by creating groups of 5 sentences as documents and limiting to 1 
> documents per language. Them, it shuffles the result and select the first 
> 10 lines as train corpus and the last 2 as evaluation corpus:
> {noformat}
> $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ 
> -sentencesPerSample 5 -samplesPerLanguage 1 > leipzig.txt
> $ perl -MList::Util=shuffle -e 'print shuffle();' < leipzig.txt > 
> leipzig_shuf.txt
> $ head -10 < leipzig_shuf.txt > leipzig.train
> $ tail -2 < leipzig_shuf.txt > leipzig.eval
> {noformat}
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1220) Add support for Byte Pair Encoding (BPE)

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1220:

Priority: Minor  (was: Major)

> Add support for Byte Pair Encoding (BPE)
> 
>
> Key: OPENNLP-1220
> URL: https://issues.apache.org/jira/browse/OPENNLP-1220
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Jörn Kottmann
>Priority: Minor
>
> It would be nice to add support for BPE to OpenNLP:
> [https://arxiv.org/pdf/1508.07909.pdf]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1381) OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1381:

Description: 
As of OpenJDK 18, the Security Manager has been deprecated (see 
[JEP-411|https://openjdk.org/jeps/411] which fails all tests in CLITest.java:

java.lang.UnsupportedOperationException: The Security Manager is deprecated and 
will be removed in a future release

    at java.base/java.lang.System.setSecurityManager(System.java:416)
    at 
opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66)
    at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.base/java.lang.reflect.Method.invoke(Method.java:577)
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at 
org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
    at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
    at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
    at 
com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
    at 
com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
    at 
com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
    at 
com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
    at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)

 

  was:
As of OpenJDK 18, the Security Manager has been deprecated (see 
[JEP-411|[https://openjdk.org/jeps/411]]) which fails all tests in CLITest.java:

java.lang.UnsupportedOperationException: The Security Manager is deprecated and 
will be removed in a future release

    at java.base/java.lang.System.setSecurityManager(System.java:416)
    at 
opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66)
    at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.base/java.lang.reflect.Method.invoke(Method.java:577)
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at 
org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
    at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
    at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at 

[jira] [Commented] (OPENNLP-1381) OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release

2022-12-09 Thread Richard Zowalla (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645340#comment-17645340
 ] 

Richard Zowalla commented on OPENNLP-1381:
--

We are testing for exit codes from System.exit(...) here.

A workaround would be to use a similar appraoch as Picocli does: 
https://github.com/remkop/picocli/issues/1503 , i.e. setting 
-Djava.security.manager=allow before running the tests (programmatically).

An alternative would be to evaluate how we can test for exit codes _or_ migrate 
that kind of test to a (scripted) CI-based integration test on Jenkins or GH 
actions.



> OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The 
> Security Manager is deprecated and will be removed in a future release
> ---
>
> Key: OPENNLP-1381
> URL: https://issues.apache.org/jira/browse/OPENNLP-1381
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.1.0
> Environment: MacOS  Monterey 12.5 Intel (same issue in M1 chip)
> brew-installed OpenJDK 18.0.2
>Reporter: Bertrand Rigaldies
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: Screen Shot 2022-07-31 at 9.10.48 PM.png, Screen Shot 
> 2022-07-31 at 9.11.09 PM.png
>
>
> As of OpenJDK 18, the Security Manager has been deprecated (see 
> [JEP-411|[https://openjdk.org/jeps/411]]) which fails all tests in 
> CLITest.java:
> java.lang.UnsupportedOperationException: The Security Manager is deprecated 
> and will be removed in a future release
>     at java.base/java.lang.System.setSecurityManager(System.java:416)
>     at 
> opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66)
>     at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>     at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>     at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>     at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>     at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>     at 
> org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
>     at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>     at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>     at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>     at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>     at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>     at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>     at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>     at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
>     at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>     at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
>     at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
>     at 
> com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
>     at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
>     at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
>     at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1381) OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1381:

Description: 
As of OpenJDK 18, the Security Manager has been deprecated (see 
[JEP-411|[https://openjdk.org/jeps/411]]) which fails all tests in CLITest.java:

java.lang.UnsupportedOperationException: The Security Manager is deprecated and 
will be removed in a future release

    at java.base/java.lang.System.setSecurityManager(System.java:416)
    at 
opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66)
    at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.base/java.lang.reflect.Method.invoke(Method.java:577)
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at 
org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
    at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
    at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
    at 
com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
    at 
com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
    at 
com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
    at 
com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
    at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)

 

  was:
As of OpenJDK 18, the Security Manager has been deprecated (see [JEP 411] 
([https://openjdk.org/jeps/411])[,|https://openjdk.org/jeps/411)),] which fails 
all tests in CLITest.java:

java.lang.UnsupportedOperationException: The Security Manager is deprecated and 
will be removed in a future release

    at java.base/java.lang.System.setSecurityManager(System.java:416)
    at 
opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66)
    at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.base/java.lang.reflect.Method.invoke(Method.java:577)
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at 
org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
    at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
    at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at 

[jira] [Updated] (OPENNLP-1381) OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1381:

Description: 
As of OpenJDK 18, the Security Manager has been deprecated (see [JEP 411] 
([https://openjdk.org/jeps/411])[,|https://openjdk.org/jeps/411)),] which fails 
all tests in CLITest.java:

java.lang.UnsupportedOperationException: The Security Manager is deprecated and 
will be removed in a future release

    at java.base/java.lang.System.setSecurityManager(System.java:416)
    at 
opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66)
    at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.base/java.lang.reflect.Method.invoke(Method.java:577)
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at 
org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
    at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
    at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
    at 
com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
    at 
com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
    at 
com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
    at 
com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
    at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)

 

  was:
As of OpenJDK 18, the Security Manager has been deprecated (see [JEP 
411]([https://openjdk.org/jeps/411)),] which fails all tests in CLITest.java:

java.lang.UnsupportedOperationException: The Security Manager is deprecated and 
will be removed in a future release

    at java.base/java.lang.System.setSecurityManager(System.java:416)
    at 
opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66)
    at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.base/java.lang.reflect.Method.invoke(Method.java:577)
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at 
org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
    at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
    at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at 

[jira] [Resolved] (OPENNLP-217) Add Detokenizer Dictionary section

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-217.

Fix Version/s: 1.9.4
   Resolution: Fixed

I checked this for version 2.1.0, 2.0.0 and 1.9.4: see for instance: 
[https://opennlp.apache.org/docs/2.1.0/manual/opennlp.html#tools.tokenizer.detokenizing.dict]

Resolving as the requested documentation is available since version 1.9.4

> Add Detokenizer Dictionary section
> --
>
> Key: OPENNLP-217
> URL: https://issues.apache.org/jira/browse/OPENNLP-217
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jörn Kottmann
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.9.4
>
>
> The documentation is lacking a section about the detokenizer dictionary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1406) Enhance JavaDoc in opennlp.tools.parser package

2022-12-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645270#comment-17645270
 ] 

ASF GitHub Bot commented on OPENNLP-1406:
-

mawiesne commented on code in PR #449:
URL: https://github.com/apache/opennlp/pull/449#discussion_r1044397338


##
opennlp-tools/src/main/java/opennlp/tools/parser/ParserModel.java:
##
@@ -41,16 +41,15 @@
 import opennlp.tools.util.model.POSModelSerializer;
 
 /**
- * This is an abstract base class for {@link ParserModel} implementations.
+ * This is the default {@link ParserModel} implementation.
  */
-// TODO: Model should validate the artifact map

Review Comment:
   No, as this is done in `ParserModel`. That's why I removed the orphaned TODO 
here.





> Enhance JavaDoc in opennlp.tools.parser package
> ---
>
> Key: OPENNLP-1406
> URL: https://issues.apache.org/jira/browse/OPENNLP-1406
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation, Parser
>Affects Versions: 2.1.0
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.1.1
>
>
> The JavaDoc the _opennlp.tools.parser_ package suffers from several 
> inconsistencies and missing descriptions. Moreover, several typos are present 
> that need sanitizing.
> It needs enhancements and/or additions to provide more clarity for readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1408) Enhance JavaDoc in opennlp.tools.doccat package

2022-12-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645262#comment-17645262
 ] 

ASF GitHub Bot commented on OPENNLP-1408:
-

mawiesne opened a new pull request, #451:
URL: https://github.com/apache/opennlp/pull/451

   Change
   -
   - adds missing JavaDoc
   - improves existing documentation for clarity
   - removes superfluous text
   - adds 'final' modifier where useful and applicable
   - adds 'Override' annotation where useful and applicable
   - fixes several typos
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Enhance JavaDoc in opennlp.tools.doccat package
> ---
>
> Key: OPENNLP-1408
> URL: https://issues.apache.org/jira/browse/OPENNLP-1408
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Doccat, Documentation
>Affects Versions: 2.1.0
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.1.1
>
>
> The JavaDoc the _opennlp.tools.doccat_ package suffers from several 
> inconsistencies and missing descriptions. Moreover, several typos are present 
> that need sanitizing.
> It needs enhancements and/or additions to provide more clarity for readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training

2022-12-09 Thread Jeff Zemerick (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick closed OPENNLP-1218.
--

> All binary implementation of AbstractModelWriter and DataReader  throws 
> java.io.UTFDataFormatException when large dataset is used for training
> --
>
> Key: OPENNLP-1218
> URL: https://issues.apache.org/jira/browse/OPENNLP-1218
> Project: OpenNLP
>  Issue Type: Bug
>Affects Versions: 1.8.4
>Reporter: Sudheer Prem
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.1.0
>
>
> All binary implementation of AbstractModelWriter and DataReader throws 
> "java.io.UTFDataFormatException: encoded string too long" in the 
> java.io.DataOutputStream.writeUTF method call when a large dataset (more than 
> 64 KB) is used for training. Looks like, this is a known limitation of 
> java.io.DataOutputStream.writeUTF method.
> Following is the stack trace:
> java.io.UTFDataFormatException: encoded string too long: 97519 bytes
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
>  at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
>  at 
> opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67)
>  at 
> opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169)
>  at 
> opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
>  at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
>  at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
>  at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
>  at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
>  
> The implementation should use byte array to resolve this issue. 
> Following is the fix to resolve this issue.
>  
> public void writeUTF(String s) throws java.io.IOException {
> byte[] ctxByte = s.getBytes("utf-8");
> output.writeInt(ctxByte.length);
> output.write(ctxByte); 
> //output.writeUTF(s);
> }
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-216) Add Detokenizer API section

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-216.

Fix Version/s: 1.9.4
 Assignee: Martin Wiesner
   Resolution: Fixed

This was handled via PR 
[https://github.com/apache/opennlp/pull/388.|https://github.com/apache/opennlp/pull/388]

Fix version is 1.9.4 - checked via 
[https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.tokenizer.detokenizing.api]

Resolving as fixed.

> Add Detokenizer API section
> ---
>
> Key: OPENNLP-216
> URL: https://issues.apache.org/jira/browse/OPENNLP-216
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jörn Kottmann
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: help-wanted, pull-request-available
> Fix For: 1.9.4
>
>
> The documentation is lacking a section about the detokenizer API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1406) Enhance JavaDoc in opennlp.tools.parser package

2022-12-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645242#comment-17645242
 ] 

ASF GitHub Bot commented on OPENNLP-1406:
-

rzo1 commented on code in PR #449:
URL: https://github.com/apache/opennlp/pull/449#discussion_r1044334326


##
opennlp-tools/src/main/java/opennlp/tools/parser/ParserEventTypeEnum.java:
##
@@ -19,13 +19,14 @@
 package opennlp.tools.parser;
 
 /**
- * Enumerated type of event types for the parser.
+ * Enumeration of event types for a {@link Parser}.
  */
 public enum ParserEventTypeEnum {
 
   BUILD,
   CHECK,
 
+  // TODO Add reason why those enum values are deprecated

Review Comment:
   +1



##
opennlp-tools/src/main/java/opennlp/tools/parser/ParserModel.java:
##
@@ -41,16 +41,15 @@
 import opennlp.tools.util.model.POSModelSerializer;
 
 /**
- * This is an abstract base class for {@link ParserModel} implementations.
+ * This is the default {@link ParserModel} implementation.
  */
-// TODO: Model should validate the artifact map

Review Comment:
   Is this still a valid todo?





> Enhance JavaDoc in opennlp.tools.parser package
> ---
>
> Key: OPENNLP-1406
> URL: https://issues.apache.org/jira/browse/OPENNLP-1406
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation, Parser
>Affects Versions: 2.1.0
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.1.1
>
>
> The JavaDoc the _opennlp.tools.parser_ package suffers from several 
> inconsistencies and missing descriptions. Moreover, several typos are present 
> that need sanitizing.
> It needs enhancements and/or additions to provide more clarity for readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-216) Add Detokenizer API section

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-216:
---
Labels: help-wanted pull-request-available  (was: help-wanted)

> Add Detokenizer API section
> ---
>
> Key: OPENNLP-216
> URL: https://issues.apache.org/jira/browse/OPENNLP-216
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jörn Kottmann
>Priority: Major
>  Labels: help-wanted, pull-request-available
>
> The documentation is lacking a section about the detokenizer API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1233) Penn Treebank tag set link insufficient

2022-12-09 Thread Martin Wiesner (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645239#comment-17645239
 ] 

Martin Wiesner commented on OPENNLP-1233:
-

[~mvsagar] The URL (https://www.clips.uantwerpen.be/pages/mbsp-tags) you 
mentioned in the description is no longer working. Can you provide an 
alternative for this?

> Penn Treebank tag set link insufficient
> ---
>
> Key: OPENNLP-1233
> URL: https://issues.apache.org/jira/browse/OPENNLP-1233
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.9.1
>Reporter: Vidyasagar Mundroy
>Priority: Trivial
>
> The link 
> "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html; 
> provided with name "Penn Treebank tag set" in sections Tagging and Chunking 
> of manual does not list tags used for chunking. A proper link is needed for 
> anybody to develop applications using openNLP. Found some info on missing 
> tags at "https://www.clips.uantwerpen.be/pages/mbsp-tags; but  not sure if 
> the link can be used in the manual.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1237) Document Categorizer example references non-existant method signature

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1237.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

This was resolved with commit 6c439ae9 by [~alanwang926] in OPENNLP-1319.

Documentation example is (now) consistent since version 2.0.0.

Resolving as (indirectly) fixed.

> Document Categorizer example references non-existant method signature
> -
>
> Key: OPENNLP-1237
> URL: https://issues.apache.org/jira/browse/OPENNLP-1237
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.9.1
>Reporter: Nick Burch
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.0.0
>
>
> In the Document Categorizer section of the manual 
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat there 
> is a code snippet in the training section:
> {code}
>   model = DocumentCategorizerME.train("en", sampleStream);
> {code}
> However, no matching method is present in the javadocs at 
> https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html
>  . The nearest seems to be one that takes two additional required parameters: 
> https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html#train-java.lang.String-opennlp.tools.util.ObjectStream-opennlp.tools.util.TrainingParameters-opennlp.tools.doccat.DoccatFactory-
> It looks like the code snippet is out of date, and needs updating to cover 
> the API changes that seem to have happened



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OPENNLP-1237) Document Categorizer example references non-existant method signature

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reassigned OPENNLP-1237:
---

Assignee: Martin Wiesner

> Document Categorizer example references non-existant method signature
> -
>
> Key: OPENNLP-1237
> URL: https://issues.apache.org/jira/browse/OPENNLP-1237
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.9.1
>Reporter: Nick Burch
>Assignee: Martin Wiesner
>Priority: Major
>
> In the Document Categorizer section of the manual 
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat there 
> is a code snippet in the training section:
> {code}
>   model = DocumentCategorizerME.train("en", sampleStream);
> {code}
> However, no matching method is present in the javadocs at 
> https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html
>  . The nearest seems to be one that takes two additional required parameters: 
> https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html#train-java.lang.String-opennlp.tools.util.ObjectStream-opennlp.tools.util.TrainingParameters-opennlp.tools.doccat.DoccatFactory-
> It looks like the code snippet is out of date, and needs updating to cover 
> the API changes that seem to have happened



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training

2022-12-09 Thread Martin Wiesner (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627227#comment-17627227
 ] 

Martin Wiesner edited comment on OPENNLP-1218 at 12/9/22 10:46 AM:
---

This is an open duplicate of #OPENNLP-1366 which got resolved recently. The 
fix(es) will be released with OpenNLP version 2.1.0.

[~jzemerick]: Could you mark this one as resolved / duplicate with the above 
fix version?
I applied the fix in #OPENNLP-1366 also for _BinaryNaiveBayesModelWriter_ as 
reported by [~sudheerprem] for his case.

Details see: [https://github.com/apache/opennlp/pull/427]


was (Author: mawiesne):
This is an open duplicate of #OPENNLP-1366 which got resolved recently. The 
fix(es) will be released with OpenNLP version 2.0.1.

[~jzemerick]: Could you mark this one as resolved / duplicate with the above 
fix version?
I applied the fix in #OPENNLP-1366 also for _BinaryNaiveBayesModelWriter_ as 
reported by [~sudheerprem] for his case.

Details see: [https://github.com/apache/opennlp/pull/427]

> All binary implementation of AbstractModelWriter and DataReader  throws 
> java.io.UTFDataFormatException when large dataset is used for training
> --
>
> Key: OPENNLP-1218
> URL: https://issues.apache.org/jira/browse/OPENNLP-1218
> Project: OpenNLP
>  Issue Type: Bug
>Affects Versions: 1.8.4
>Reporter: Sudheer Prem
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.1.0
>
>
> All binary implementation of AbstractModelWriter and DataReader throws 
> "java.io.UTFDataFormatException: encoded string too long" in the 
> java.io.DataOutputStream.writeUTF method call when a large dataset (more than 
> 64 KB) is used for training. Looks like, this is a known limitation of 
> java.io.DataOutputStream.writeUTF method.
> Following is the stack trace:
> java.io.UTFDataFormatException: encoded string too long: 97519 bytes
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
>  at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
>  at 
> opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67)
>  at 
> opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169)
>  at 
> opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
>  at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
>  at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
>  at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
>  at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
>  
> The implementation should use byte array to resolve this issue. 
> Following is the fix to resolve this issue.
>  
> public void writeUTF(String s) throws java.io.IOException {
> byte[] ctxByte = s.getBytes("utf-8");
> output.writeInt(ctxByte.length);
> output.write(ctxByte); 
> //output.writeUTF(s);
> }
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1218.
-
Fix Version/s: 2.1.0
 Assignee: Martin Wiesner
   Resolution: Duplicate

> All binary implementation of AbstractModelWriter and DataReader  throws 
> java.io.UTFDataFormatException when large dataset is used for training
> --
>
> Key: OPENNLP-1218
> URL: https://issues.apache.org/jira/browse/OPENNLP-1218
> Project: OpenNLP
>  Issue Type: Bug
>Affects Versions: 1.8.4
>Reporter: Sudheer Prem
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.1.0
>
>
> All binary implementation of AbstractModelWriter and DataReader throws 
> "java.io.UTFDataFormatException: encoded string too long" in the 
> java.io.DataOutputStream.writeUTF method call when a large dataset (more than 
> 64 KB) is used for training. Looks like, this is a known limitation of 
> java.io.DataOutputStream.writeUTF method.
> Following is the stack trace:
> java.io.UTFDataFormatException: encoded string too long: 97519 bytes
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
>  at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
>  at 
> opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67)
>  at 
> opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169)
>  at 
> opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
>  at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
>  at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
>  at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
>  at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
>  
> The implementation should use byte array to resolve this issue. 
> Following is the fix to resolve this issue.
>  
> public void writeUTF(String s) throws java.io.IOException {
> byte[] ctxByte = s.getBytes("utf-8");
> output.writeInt(ctxByte.length);
> output.write(ctxByte); 
> //output.writeUTF(s);
> }
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (OPENNLP-1395) Release OpenNLP 2.1.0

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reopened OPENNLP-1395:
-

> Release OpenNLP 2.1.0
> -
>
> Key: OPENNLP-1395
> URL: https://issues.apache.org/jira/browse/OPENNLP-1395
> Project: OpenNLP
>  Issue Type: Task
>  Components: Build, Packaging and Test
>Reporter: Jeff Zemerick
>Assignee: Jeff Zemerick
>Priority: Major
> Fix For: 2.0.1, 2.1.0
>
>
> This ticket is to track the work to release OpenNLP 2.1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1395) Release OpenNLP 2.1.0

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1395.
-
Resolution: Fixed

> Release OpenNLP 2.1.0
> -
>
> Key: OPENNLP-1395
> URL: https://issues.apache.org/jira/browse/OPENNLP-1395
> Project: OpenNLP
>  Issue Type: Task
>  Components: Build, Packaging and Test
>Reporter: Jeff Zemerick
>Assignee: Jeff Zemerick
>Priority: Major
> Fix For: 2.0.1, 2.1.0
>
>
> This ticket is to track the work to release OpenNLP 2.1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (OPENNLP-1395) Release OpenNLP 2.1.0

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner closed OPENNLP-1395.
---

OpenNLP version 2.1.0 is out for some days now, see: 
[https://opennlp.apache.org/download.html.|https://opennlp.apache.org/download.html]

Closing issue as resolved.

> Release OpenNLP 2.1.0
> -
>
> Key: OPENNLP-1395
> URL: https://issues.apache.org/jira/browse/OPENNLP-1395
> Project: OpenNLP
>  Issue Type: Task
>  Components: Build, Packaging and Test
>Reporter: Jeff Zemerick
>Assignee: Jeff Zemerick
>Priority: Major
> Fix For: 2.0.1, 2.1.0
>
>
> This ticket is to track the work to release OpenNLP 2.1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OPENNLP-1395) Release OpenNLP 2.1.0

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner resolved OPENNLP-1395.
-
Fix Version/s: 2.1.0
   Resolution: Done

> Release OpenNLP 2.1.0
> -
>
> Key: OPENNLP-1395
> URL: https://issues.apache.org/jira/browse/OPENNLP-1395
> Project: OpenNLP
>  Issue Type: Task
>  Components: Build, Packaging and Test
>Reporter: Jeff Zemerick
>Assignee: Jeff Zemerick
>Priority: Major
> Fix For: 2.0.1, 2.1.0
>
>
> This ticket is to track the work to release OpenNLP 2.1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1307) Incorrect code example for Document Categorization (9.3)

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1307:

Fix Version/s: 2.1.1

> Incorrect code example for Document Categorization (9.3)
> 
>
> Key: OPENNLP-1307
> URL: https://issues.apache.org/jira/browse/OPENNLP-1307
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Doccat
>Affects Versions: 1.9.3
> Environment: N/A
>Reporter: John Slocum
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: DocumentCategorizerME, documentation
> Fix For: 2.1.1
>
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> in 
> [https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,]
> the code example feeds a String into DocumentCategorizerME.categorize(). The 
> method itself takes an array. I flagged priority as Major because this was a 
> killer - obviously it's a self-documenting bug when you run it, but I made 
> the mistake of assuming that the array actually needed would be an array of 
> documents - instead it needs to be an array of tokens from a single document, 
> i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting 
> with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) 
> before figuring this one out.
>  
> Current(wrong) version:
>  
> {code:java}
> String inputText = ...
> DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
> double[] outcomes = myCategorizer.categorize(inputText);
> String category = myCategorizer.getBestCategory(outcomes);
> {code}
>  
> Should be more like:
>  
> {code:java}
> String inputText = ... // sanitized document to be categorized
> DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
> double[] outcomes = myCategorizer.categorize(inputText.split(" ");
> String category = myCategorizer.getBestCategory(outcomes);
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OPENNLP-1307) Incorrect code example for Document Categorization (9.3)

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reassigned OPENNLP-1307:
---

Assignee: Martin Wiesner

> Incorrect code example for Document Categorization (9.3)
> 
>
> Key: OPENNLP-1307
> URL: https://issues.apache.org/jira/browse/OPENNLP-1307
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Doccat
>Affects Versions: 1.9.3
> Environment: N/A
>Reporter: John Slocum
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: DocumentCategorizerME, documentation
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> in 
> [https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,]
> the code example feeds a String into DocumentCategorizerME.categorize(). The 
> method itself takes an array. I flagged priority as Major because this was a 
> killer - obviously it's a self-documenting bug when you run it, but I made 
> the mistake of assuming that the array actually needed would be an array of 
> documents - instead it needs to be an array of tokens from a single document, 
> i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting 
> with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) 
> before figuring this one out.
>  
> Current(wrong) version:
>  
> {code:java}
> String inputText = ...
> DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
> double[] outcomes = myCategorizer.categorize(inputText);
> String category = myCategorizer.getBestCategory(outcomes);
> {code}
>  
> Should be more like:
>  
> {code:java}
> String inputText = ... // sanitized document to be categorized
> DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
> double[] outcomes = myCategorizer.categorize(inputText.split(" ");
> String category = myCategorizer.getBestCategory(outcomes);
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OPENNLP-1405) Enhance JavaDoc in opennlp.tools.tokenize package

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reassigned OPENNLP-1405:
---

Assignee: Martin Wiesner

> Enhance JavaDoc in opennlp.tools.tokenize package
> -
>
> Key: OPENNLP-1405
> URL: https://issues.apache.org/jira/browse/OPENNLP-1405
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation, Tokenizer
>Affects Versions: 2.0.0
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.1.1
>
>
> The JavaDoc the _opennlp.tools.tokenize_ package suffers from several 
> inconsistencies and missing descriptions. Moreover, several typos are present 
> that need sanitizing.
> It needs enhancements and/or additions to provide more clarity for readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OPENNLP-1406) Enhance JavaDoc in opennlp.tools.parser package

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reassigned OPENNLP-1406:
---

Assignee: Martin Wiesner

> Enhance JavaDoc in opennlp.tools.parser package
> ---
>
> Key: OPENNLP-1406
> URL: https://issues.apache.org/jira/browse/OPENNLP-1406
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation, Parser
>Affects Versions: 2.1.0
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.1.1
>
>
> The JavaDoc the _opennlp.tools.parser_ package suffers from several 
> inconsistencies and missing descriptions. Moreover, several typos are present 
> that need sanitizing.
> It needs enhancements and/or additions to provide more clarity for readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OPENNLP-1408) Enhance JavaDoc in opennlp.tools.doccat package

2022-12-09 Thread Martin Wiesner (Jira)
Martin Wiesner created OPENNLP-1408:
---

 Summary: Enhance JavaDoc in opennlp.tools.doccat package
 Key: OPENNLP-1408
 URL: https://issues.apache.org/jira/browse/OPENNLP-1408
 Project: OpenNLP
  Issue Type: Improvement
  Components: Doccat, Documentation
Affects Versions: 2.1.0
Reporter: Martin Wiesner
Assignee: Martin Wiesner
 Fix For: 2.1.1


The JavaDoc the _opennlp.tools.doccat_ package suffers from several 
inconsistencies and missing descriptions. Moreover, several typos are present 
that need sanitizing.

It needs enhancements and/or additions to provide more clarity for readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OPENNLP-1404) Enhance JavaDoc in opennlp.tools.postag package

2022-12-09 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner reassigned OPENNLP-1404:
---

Assignee: Martin Wiesner

> Enhance JavaDoc in opennlp.tools.postag package
> ---
>
> Key: OPENNLP-1404
> URL: https://issues.apache.org/jira/browse/OPENNLP-1404
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation, POS Tagger
>Affects Versions: 2.1.0
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.1.1
>
>
> The JavaDoc of the _opennlp.tools.postag_ package suffers from several 
> inconsistencies and missing descriptions. Moreover, several typos are present 
> that need sanitizing.
> It needs enhancements and/or additions to provide more clarity for readers of 
> this part of the OpenNLP API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1307) Incorrect code example for Document Categorization (9.3)

2022-12-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645226#comment-17645226
 ] 

ASF GitHub Bot commented on OPENNLP-1307:
-

mawiesne opened a new pull request, #450:
URL: https://github.com/apache/opennlp/pull/450

   Change
   -
   - fixes `doccat.xml` as proposed in OPENNLP-1307; the current example in the 
DOC was inconsistent with `DocumentCategorizerME#categorize(..)` and required a 
change
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Incorrect code example for Document Categorization (9.3)
> 
>
> Key: OPENNLP-1307
> URL: https://issues.apache.org/jira/browse/OPENNLP-1307
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Doccat
>Affects Versions: 1.9.3
> Environment: N/A
>Reporter: John Slocum
>Priority: Major
>  Labels: DocumentCategorizerME, documentation
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> in 
> [https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,]
> the code example feeds a String into DocumentCategorizerME.categorize(). The 
> method itself takes an array. I flagged priority as Major because this was a 
> killer - obviously it's a self-documenting bug when you run it, but I made 
> the mistake of assuming that the array actually needed would be an array of 
> documents - instead it needs to be an array of tokens from a single document, 
> i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting 
> with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) 
> before figuring this one out.
>  
> Current(wrong) version:
>  
> {code:java}
> String inputText = ...
> DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
> double[] outcomes = myCategorizer.categorize(inputText);
> String category = myCategorizer.getBestCategory(outcomes);
> {code}
>  
> Should be more like:
>  
> {code:java}
> String inputText = ... // sanitized document to be categorized
> DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
> double[] outcomes = myCategorizer.categorize(inputText.split(" ");
> String category = myCategorizer.getBestCategory(outcomes);
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1406) Enhance JavaDoc in opennlp.tools.parser package

2022-12-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645203#comment-17645203
 ] 

ASF GitHub Bot commented on OPENNLP-1406:
-

mawiesne opened a new pull request, #449:
URL: https://github.com/apache/opennlp/pull/449

   Change
   -
   - adds missing JavaDoc
   - improves existing documentation for clarity
   - removes superfluous text
   - removes orphaned (commented) code fragments in 
`..parser.treeinsert.ParserEventStream`
   - fixes a missing variable assignment in `ParserCrossValidator` (a hidden 
bug)
   - adds 'final' modifier where useful and applicable
   - adds 'Override' annotation where useful and applicable
   - fixes several typos
   - corrects some inconsistencies in the `opennlp.tools.chunker` and 
`opennlp.tools.langdetect` package
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Enhance JavaDoc in opennlp.tools.parser package
> ---
>
> Key: OPENNLP-1406
> URL: https://issues.apache.org/jira/browse/OPENNLP-1406
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation, Parser
>Affects Versions: 2.1.0
>Reporter: Martin Wiesner
>Priority: Minor
> Fix For: 2.1.1
>
>
> The JavaDoc the _opennlp.tools.parser_ package suffers from several 
> inconsistencies and missing descriptions. Moreover, several typos are present 
> that need sanitizing.
> It needs enhancements and/or additions to provide more clarity for readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)