[jira] [Updated] (OPENNLP-1314) ConlluWordLine to print line contents when throwing format error
[ https://issues.apache.org/jira/browse/OPENNLP-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1314: Fix Version/s: 2.1.0 > ConlluWordLine to print line contents when throwing format error > > > Key: OPENNLP-1314 > URL: https://issues.apache.org/jira/browse/OPENNLP-1314 > Project: OpenNLP > Issue Type: Improvement > Components: Formats >Affects Versions: 1.9.3 >Reporter: Markus Jelsma >Assignee: Martin Wiesner >Priority: Trivial > Fix For: 2.1.0 > > Attachments: OPENNLP-1314.patch > > > Exception thrown for edit/formatting errors is not helpful in debugging. This > tiny patch makes my day much easier. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1314) ConlluWordLine to print line contents when throwing format error
[ https://issues.apache.org/jira/browse/OPENNLP-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1314. - Assignee: Martin Wiesner Resolution: Fixed This was resolved by commit 8069a734 in the context of OPENNLP-1372. Closing as duplicate. > ConlluWordLine to print line contents when throwing format error > > > Key: OPENNLP-1314 > URL: https://issues.apache.org/jira/browse/OPENNLP-1314 > Project: OpenNLP > Issue Type: Improvement > Components: Formats >Affects Versions: 1.9.3 >Reporter: Markus Jelsma >Assignee: Martin Wiesner >Priority: Trivial > Attachments: OPENNLP-1314.patch > > > Exception thrown for edit/formatting errors is not helpful in debugging. This > tiny patch makes my day much easier. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-731) SAMpLes
[ https://issues.apache.org/jira/browse/OPENNLP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-731: --- Issue Type: Wish (was: Improvement) > SAMpLes > --- > > Key: OPENNLP-731 > URL: https://issues.apache.org/jira/browse/OPENNLP-731 > Project: OpenNLP > Issue Type: Wish > Environment: Java on Linux OS >Reporter: Giuseppe Laurenza >Priority: Minor > Labels: features > > Sentiment Analysis with MultiPle LanguagES [SAMpLes] is an engine that > provided algorithms for sentiment analysis, voting phrases with a range from > 1,0 to 5,0. The particularity of this engine is that it use "Online-Reviews > vith a vote" to build the dictionary. It comes with two complete dictionaries > for english and italian languages and with the functions to generate a > custom dictionary using a personal dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-731) Integrate SAMpLes engine
[ https://issues.apache.org/jira/browse/OPENNLP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-731: --- Summary: Integrate SAMpLes engine (was: Integrate SAMpLes) > Integrate SAMpLes engine > > > Key: OPENNLP-731 > URL: https://issues.apache.org/jira/browse/OPENNLP-731 > Project: OpenNLP > Issue Type: Wish > Environment: Java on Linux OS >Reporter: Giuseppe Laurenza >Priority: Minor > Labels: features > > Sentiment Analysis with MultiPle LanguagES [SAMpLes] is an engine that > provided algorithms for sentiment analysis, voting phrases with a range from > 1,0 to 5,0. The particularity of this engine is that it use "Online-Reviews > vith a vote" to build the dictionary. It comes with two complete dictionaries > for english and italian languages and with the functions to generate a > custom dictionary using a personal dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-731) Integrate SAMpLes
[ https://issues.apache.org/jira/browse/OPENNLP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-731: --- Summary: Integrate SAMpLes (was: SAMpLes) > Integrate SAMpLes > - > > Key: OPENNLP-731 > URL: https://issues.apache.org/jira/browse/OPENNLP-731 > Project: OpenNLP > Issue Type: Wish > Environment: Java on Linux OS >Reporter: Giuseppe Laurenza >Priority: Minor > Labels: features > > Sentiment Analysis with MultiPle LanguagES [SAMpLes] is an engine that > provided algorithms for sentiment analysis, voting phrases with a range from > 1,0 to 5,0. The particularity of this engine is that it use "Online-Reviews > vith a vote" to build the dictionary. It comes with two complete dictionaries > for english and italian languages and with the functions to generate a > custom dictionary using a personal dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-504) Add a FAQ page to our site
[ https://issues.apache.org/jira/browse/OPENNLP-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-504. Resolution: Fixed The FAQ is available here: [https://opennlp.apache.org/faq.html] and has the content proposed/discussed. > Add a FAQ page to our site > -- > > Key: OPENNLP-504 > URL: https://issues.apache.org/jira/browse/OPENNLP-504 > Project: OpenNLP > Issue Type: Improvement > Components: Website >Reporter: James Kosin >Assignee: Bruno P. Kinoshita >Priority: Minor > Labels: FAQ, newbie > Attachments: opennlp-faq-wip-20170512-fullpage.png > > > Collect and assemble a FAQ page for our site. > Most questions start out: > Where can I get the models? > Where do I start getting to know OpenNLP? > etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1345) The Training API code for Sentence Detection is outdated in manual
[ https://issues.apache.org/jira/browse/OPENNLP-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1345. - Fix Version/s: 2.0.0 Resolution: Fixed This is fixed with commit 6c4dc364 in the context of OPENNLP-1362. Since this change, the documentation is consistent with the API again. > The Training API code for Sentence Detection is outdated in manual > -- > > Key: OPENNLP-1345 > URL: https://issues.apache.org/jira/browse/OPENNLP-1345 > Project: OpenNLP > Issue Type: Bug > Components: Documentation >Affects Versions: 1.9.4 >Reporter: Phillip Rhodes >Priority: Minor > Labels: documentation, easy-fix > Fix For: 2.0.0 > > > The Training API example code at > [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html] in the section on > Sentence Detection training is incorrect. The current code sample is: > {code:java} > ObjectStream lineStream = > new PlainTextByLineStream(new FileInputStream("en-sent.train"), > StandardCharsets.UTF_8); > {code} > But PlainTextByLineStream no longer takes an InputStream as the first > argument to its constructor. It now requires an InputStreamFactory. > NOTE: this same pattern reappears in multiple places in the current manual. > See also, OPENNLP-1319 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1346) The Training API code for Tokenization is outdated in manual (1/2)
[ https://issues.apache.org/jira/browse/OPENNLP-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1346. - Fix Version/s: 2.0.0 Assignee: Martin Wiesner Resolution: Fixed This is fixed with commit 6c4dc364 in the context of OPENNLP-1362. Since this change, the documentation is consistent with the API again. > The Training API code for Tokenization is outdated in manual (1/2) > -- > > Key: OPENNLP-1346 > URL: https://issues.apache.org/jira/browse/OPENNLP-1346 > Project: OpenNLP > Issue Type: Bug > Components: Documentation >Affects Versions: 1.9.4 >Reporter: Phillip Rhodes >Assignee: Martin Wiesner >Priority: Minor > Labels: documentation, easy-fix > Fix For: 2.0.0 > > > The Training API example code at > [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html] in the section > dealing with Tokenizer training incorrect. The current code sample is: > {code:java} > ObjectStream lineStream = new PlainTextByLineStream(new > FileInputStream("en-sent.train"), > StandardCharsets.UTF_8);{code} > But PlainTextByLineStream no longer takes an InputStream as the first > argument to its constructor. It now requires an InputStreamFactory. > NOTE: this same pattern reappears in multiple places in the current manual. > See also, OPENNLP-1319 and OPENNLP-1345 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1349) The Training API code for Document Categorization is outdated in manual
[ https://issues.apache.org/jira/browse/OPENNLP-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1349. - Fix Version/s: 2.0.0 Assignee: Martin Wiesner Resolution: Fixed This is fixed with commit 6c4dc364 in the context of OPENNLP-1362. Since the change, the documentation is consistent with the API again. > The Training API code for Document Categorization is outdated in manual > --- > > Key: OPENNLP-1349 > URL: https://issues.apache.org/jira/browse/OPENNLP-1349 > Project: OpenNLP > Issue Type: Bug > Components: Documentation >Affects Versions: 1.9.4 >Reporter: Phillip Rhodes >Assignee: Martin Wiesner >Priority: Minor > Labels: documentation, easy-fix > Fix For: 2.0.0 > > > The Training API example code at > [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html] in the section > dealing with TokenNameFinder training incorrect. The current code sample > includes: > {code:java} > try (dataIn = new FileInputStream("en-sentiment.train")) { > ObjectStream lineStream = > new PlainTextByLineStream(dataIn, StandardCharsets.UTF_8); > }{code} > But PlainTextByLineStream no longer takes an InputStream as the first > argument to its constructor. It now requires an InputStreamFactory. > NOTE: this same pattern reappears in multiple places in the current manual. > See also, OPENNLP-1319, OPENNLP-1345, and OPENNLP-1346 among others. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1348) The Training API code for NamedEntityRecognition is outdated in manual
[ https://issues.apache.org/jira/browse/OPENNLP-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1348. - Fix Version/s: 2.0.0 Assignee: Martin Wiesner Resolution: Fixed This is fixed with commit 6c4dc364 in the context of OPENNLP-1362. Since the change, the documentation is consistent with the API again. > The Training API code for NamedEntityRecognition is outdated in manual > -- > > Key: OPENNLP-1348 > URL: https://issues.apache.org/jira/browse/OPENNLP-1348 > Project: OpenNLP > Issue Type: Bug > Components: Documentation >Affects Versions: 1.9.4 >Reporter: Phillip Rhodes >Assignee: Martin Wiesner >Priority: Minor > Labels: documentation, easy-fix > Fix For: 2.0.0 > > > The Training API example code at > [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html] in the section > dealing with TokenNameFinder training incorrect. The current code sample > includes: > {code:java} > ObjectStream lineStream = new PlainTextByLineStream(new > FileInputStream("en-sent.train"), > StandardCharsets.UTF_8);{code} > But PlainTextByLineStream no longer takes an InputStream as the first > argument to its constructor. It now requires an InputStreamFactory. > NOTE: this same pattern reappears in multiple places in the current manual. > See also, OPENNLP-1319, and OPENNLP-1345, and OPENNLP-1346 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1347) The Training API code for Tokenization is outdated in manual (2/2)
[ https://issues.apache.org/jira/browse/OPENNLP-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1347. - Fix Version/s: 2.0.0 Assignee: Martin Wiesner Resolution: Fixed This is fixed with commit 6c439ae9 in the context of OPENNLP-1319. Since the change, the documentation is consistent with the API again. > The Training API code for Tokenization is outdated in manual (2/2) > -- > > Key: OPENNLP-1347 > URL: https://issues.apache.org/jira/browse/OPENNLP-1347 > Project: OpenNLP > Issue Type: Bug > Components: Documentation >Affects Versions: 1.9.4 >Reporter: Phillip Rhodes >Assignee: Martin Wiesner >Priority: Minor > Labels: documentation, easy-fix > Fix For: 2.0.0 > > > The code sample in the manual at <> in the section on Tokenizer training has > is incorrect. The current code sample is: > > {code:java} > try { > model = TokenizerME.train("en", sampleStream, true, > TrainingParameters.defaultParams()); > } {code} > But TokenizerME.train() now has a new signature which requires a > TokenizerFactory. The above does not compile with the 1.9.4 library version. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (OPENNLP-1357) Use CharSequence to allow for memory management
[ https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reassigned OPENNLP-1357: --- Assignee: Martin Wiesner > Use CharSequence to allow for memory management > --- > > Key: OPENNLP-1357 > URL: https://issues.apache.org/jira/browse/OPENNLP-1357 > Project: OpenNLP > Issue Type: Improvement > Components: Sentence Detector >Affects Versions: 1.9.4 >Reporter: Paul Austin >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.1.1 > > > Most of the classes in OpenNLP require the inputs to be as String, > StringBuffer, or char[]. This means that you have to load all the data into > memory. > Many of these cases (String and StringBuffer args) could be replaced with a > single method that accepts CharSequence as a parameter. > For example DefaultEndOfSentenceScanner > > {code:java} > public List getPositions(CharSequence s) { > List l = new ArrayList<>(); > for (int i = 0; i < s.length(); i++) { > char c = s.charAt(i); > if (eosCharacters.contains(c)) { > l.add(i); > } > } > return l; > } > {code} > This would allow for users to manage the memory overhead for large data sets. > And in some cases require less temporary memory conversion to char buffers. > Some code such as the SDContextGenerator already uses CharSequence. However > in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. > The sb isn't modified and the SDContextGenerator.getContext takes > CharSequence as an arg and String is a CharSequence. > > {code:java} > public Span[] sentPosDetect(String s) { > sentProbs.clear(); > StringBuffer sb = new StringBuffer(s);{code} > > I can create a pull request(s) for the above if you think it is useful. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1344) Broken link (404) for Leipzig corpora in OpenNLP Manual
[ https://issues.apache.org/jira/browse/OPENNLP-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645383#comment-17645383 ] ASF GitHub Bot commented on OPENNLP-1344: - mawiesne opened a new pull request, #454: URL: https://github.com/apache/opennlp/pull/454 Change - - adjusts URL to Leipzig corpora in corresponding `langdetect.xml`, as proposed by reporter 'P. Rhodes' - verified the new URL works as expected Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Broken link (404) for Leipzig corpora in OpenNLP Manual > --- > > Key: OPENNLP-1344 > URL: https://issues.apache.org/jira/browse/OPENNLP-1344 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Reporter: Phillip Rhodes >Assignee: Martin Wiesner >Priority: Minor > Labels: documentation, easyfix > Fix For: 2.1.1 > > > In the User Manual at: > https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html > The download link for the Leipzig Corpora is listed as: > https://corpora.uni-leipzig.de/download.html > however this link returns a 404 Not Found error. The correct link now appears > to be: > [https://wortschatz.uni-leipzig.de/en/download] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1344) Broken link (404) for Leipzig corpora in OpenNLP Manual
[ https://issues.apache.org/jira/browse/OPENNLP-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1344: Fix Version/s: 2.1.1 > Broken link (404) for Leipzig corpora in OpenNLP Manual > --- > > Key: OPENNLP-1344 > URL: https://issues.apache.org/jira/browse/OPENNLP-1344 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Reporter: Phillip Rhodes >Priority: Minor > Labels: documentation, easyfix > Fix For: 2.1.1 > > > In the User Manual at: > https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html > The download link for the Leipzig Corpora is listed as: > https://corpora.uni-leipzig.de/download.html > however this link returns a 404 Not Found error. The correct link now appears > to be: > [https://wortschatz.uni-leipzig.de/en/download] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (OPENNLP-1344) Broken link (404) for Leipzig corpora in OpenNLP Manual
[ https://issues.apache.org/jira/browse/OPENNLP-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reassigned OPENNLP-1344: --- Assignee: Martin Wiesner > Broken link (404) for Leipzig corpora in OpenNLP Manual > --- > > Key: OPENNLP-1344 > URL: https://issues.apache.org/jira/browse/OPENNLP-1344 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Reporter: Phillip Rhodes >Assignee: Martin Wiesner >Priority: Minor > Labels: documentation, easyfix > Fix For: 2.1.1 > > > In the User Manual at: > https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html > The download link for the Leipzig Corpora is listed as: > https://corpora.uni-leipzig.de/download.html > however this link returns a 404 Not Found error. The correct link now appears > to be: > [https://wortschatz.uni-leipzig.de/en/download] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1344) Broken link (404) for Leipzig corpora in OpenNLP Manual
[ https://issues.apache.org/jira/browse/OPENNLP-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1344: Issue Type: Documentation (was: Bug) > Broken link (404) for Leipzig corpora in OpenNLP Manual > --- > > Key: OPENNLP-1344 > URL: https://issues.apache.org/jira/browse/OPENNLP-1344 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Reporter: Phillip Rhodes >Priority: Minor > Labels: documentation, easyfix > > In the User Manual at: > https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html > The download link for the Leipzig Corpora is listed as: > https://corpora.uni-leipzig.de/download.html > however this link returns a 404 Not Found error. The correct link now appears > to be: > [https://wortschatz.uni-leipzig.de/en/download] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1357) Use CharSequence to allow for memory management
[ https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1357: Issue Type: Improvement (was: New Feature) > Use CharSequence to allow for memory management > --- > > Key: OPENNLP-1357 > URL: https://issues.apache.org/jira/browse/OPENNLP-1357 > Project: OpenNLP > Issue Type: Improvement > Components: Sentence Detector >Affects Versions: 1.9.4 >Reporter: Paul Austin >Priority: Minor > Fix For: 2.1.1 > > > Most of the classes in OpenNLP require the inputs to be as String, > StringBuffer, or char[]. This means that you have to load all the data into > memory. > Many of these cases (String and StringBuffer args) could be replaced with a > single method that accepts CharSequence as a parameter. > For example DefaultEndOfSentenceScanner > > {code:java} > public List getPositions(CharSequence s) { > List l = new ArrayList<>(); > for (int i = 0; i < s.length(); i++) { > char c = s.charAt(i); > if (eosCharacters.contains(c)) { > l.add(i); > } > } > return l; > } > {code} > This would allow for users to manage the memory overhead for large data sets. > And in some cases require less temporary memory conversion to char buffers. > Some code such as the SDContextGenerator already uses CharSequence. However > in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. > The sb isn't modified and the SDContextGenerator.getContext takes > CharSequence as an arg and String is a CharSequence. > > {code:java} > public Span[] sentPosDetect(String s) { > sentProbs.clear(); > StringBuffer sb = new StringBuffer(s);{code} > > I can create a pull request(s) for the above if you think it is useful. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1357) Use CharSequence to allow for memory management
[ https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1357: Fix Version/s: 2.1.1 > Use CharSequence to allow for memory management > --- > > Key: OPENNLP-1357 > URL: https://issues.apache.org/jira/browse/OPENNLP-1357 > Project: OpenNLP > Issue Type: New Feature > Components: Sentence Detector >Affects Versions: 1.9.4 >Reporter: Paul Austin >Priority: Minor > Fix For: 2.1.1 > > > Most of the classes in OpenNLP require the inputs to be as String, > StringBuffer, or char[]. This means that you have to load all the data into > memory. > Many of these cases (String and StringBuffer args) could be replaced with a > single method that accepts CharSequence as a parameter. > For example DefaultEndOfSentenceScanner > > {code:java} > public List getPositions(CharSequence s) { > List l = new ArrayList<>(); > for (int i = 0; i < s.length(); i++) { > char c = s.charAt(i); > if (eosCharacters.contains(c)) { > l.add(i); > } > } > return l; > } > {code} > This would allow for users to manage the memory overhead for large data sets. > And in some cases require less temporary memory conversion to char buffers. > Some code such as the SDContextGenerator already uses CharSequence. However > in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. > The sb isn't modified and the SDContextGenerator.getContext takes > CharSequence as an arg and String is a CharSequence. > > {code:java} > public Span[] sentPosDetect(String s) { > sentProbs.clear(); > StringBuffer sb = new StringBuffer(s);{code} > > I can create a pull request(s) for the above if you think it is useful. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1357) Use CharSequence to allow for memory management
[ https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645376#comment-17645376 ] ASF GitHub Bot commented on OPENNLP-1357: - mawiesne opened a new pull request, #453: URL: https://github.com/apache/opennlp/pull/453 Change - - adjusts method signatures in `SentenceDetector` and `EndOfSentenceScanner` to use CharSequence` as proposed by reporter 'P. Austin' - adapts existing impl classes to work (fine) with this change, see comments in OPENNLP-1357 - adjusts JavaDoc accordingly - adds 'Override' annotations in some spots where they were missing Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Use CharSequence to allow for memory management > --- > > Key: OPENNLP-1357 > URL: https://issues.apache.org/jira/browse/OPENNLP-1357 > Project: OpenNLP > Issue Type: New Feature > Components: Sentence Detector >Affects Versions: 1.9.4 >Reporter: Paul Austin >Priority: Minor > > Most of the classes in OpenNLP require the inputs to be as String, > StringBuffer, or char[]. This means that you have to load all the data into > memory. > Many of these cases (String and StringBuffer args) could be replaced with a > single method that accepts CharSequence as a parameter. > For example DefaultEndOfSentenceScanner > > {code:java} > public List getPositions(CharSequence s) { > List l = new ArrayList<>(); > for (int i = 0; i < s.length(); i++) { > char c = s.charAt(i); > if (eosCharacters.contains(c)) { > l.add(i); > } > } > return l; > } > {code} > This would allow for users to manage the memory overhead for large data sets. > And in some cases require less temporary memory conversion to char buffers. > Some code such as the SDContextGenerator already uses CharSequence. However > in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. > The sb isn't modified and the SDContextGenerator.getContext takes > CharSequence as an arg and String is a CharSequence. > > {code:java} > public Span[] sentPosDetect(String s) { > sentProbs.clear(); > StringBuffer sb = new StringBuffer(s);{code} > > I can create a pull request(s) for the above if you think it is useful. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1363) Verify the documentation of the lemmatizer input format
[ https://issues.apache.org/jira/browse/OPENNLP-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1363: Issue Type: Documentation (was: Task) > Verify the documentation of the lemmatizer input format > --- > > Key: OPENNLP-1363 > URL: https://issues.apache.org/jira/browse/OPENNLP-1363 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Reporter: Jeff Zemerick >Priority: Minor > > In OPENNLP-1257, a change was proposed to update the code to split the > lemmatizer input by spaces instead of by tab. I believe tab is the desired > delimiter but we need to make sure the documentation is consistent. > Refer to > [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer|https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer.] > , in particular the following sentences: > "The training data consist of three columns separated by spaces. Each word > has been put on a separate line and there is an empty line after each > sentence. The first column contains the current word, the second its > part-of-speech tag and the third its lemma. Here is an example of the file > format:" > Determine if that first line should read "separated by tabs" instead. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1363) Verify the documentation of the lemmatizer input format
[ https://issues.apache.org/jira/browse/OPENNLP-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1363: Affects Version/s: 2.1.0 > Verify the documentation of the lemmatizer input format > --- > > Key: OPENNLP-1363 > URL: https://issues.apache.org/jira/browse/OPENNLP-1363 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Jeff Zemerick >Priority: Minor > > In OPENNLP-1257, a change was proposed to update the code to split the > lemmatizer input by spaces instead of by tab. I believe tab is the desired > delimiter but we need to make sure the documentation is consistent. > Refer to > [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer|https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer.] > , in particular the following sentences: > "The training data consist of three columns separated by spaces. Each word > has been put on a separate line and there is an empty line after each > sentence. The first column contains the current word, the second its > part-of-speech tag and the third its lemma. Here is an example of the file > format:" > Determine if that first line should read "separated by tabs" instead. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1135) Remove support for OSGi
[ https://issues.apache.org/jira/browse/OPENNLP-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1135: Issue Type: Wish (was: Improvement) > Remove support for OSGi > --- > > Key: OPENNLP-1135 > URL: https://issues.apache.org/jira/browse/OPENNLP-1135 > Project: OpenNLP > Issue Type: Wish >Reporter: Jörn Kottmann >Priority: Minor > > Remove the OSGi bundle support from the opennlp-tools jar. OSGi isn't used > widely and the ones who are using it know how to use opennlp-tools in an OSGi > environment anyway by applying some build tricks. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-705) integrate Similarity with VerbNet
[ https://issues.apache.org/jira/browse/OPENNLP-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-705: --- Issue Type: Wish (was: Bug) > integrate Similarity with VerbNet > - > > Key: OPENNLP-705 > URL: https://issues.apache.org/jira/browse/OPENNLP-705 > Project: OpenNLP > Issue Type: Wish >Reporter: Boris Galitsky >Assignee: Boris Galitsky >Priority: Major > > When matching two parse trees, features of the verbs other than POS needs to > be taken into account for more accurate matching -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-509) opennlp.tools.parser.Parse.getParent() returning incorrect object
[ https://issues.apache.org/jira/browse/OPENNLP-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-509: --- Issue Type: Bug (was: Improvement) > opennlp.tools.parser.Parse.getParent() returning incorrect object > - > > Key: OPENNLP-509 > URL: https://issues.apache.org/jira/browse/OPENNLP-509 > Project: OpenNLP > Issue Type: Bug > Components: Parser >Affects Versions: tools-1.5.2-incubating > Environment: OpenNLP invoked from C# (.Net 4) via IKVM ikvm-7.0.4335.0 > OpenNLP library dll created using the command: ikvmc -target:library > -assembly:opennlp %OPENNLP_LIB_PATH%\opennlp-maxent-3.0.2-incubating.jar > %OPENNLP_LIB_PATH%\jwnl-1.3.3.jar > %OPENNLP_LIB_PATH%\opennlp-tools-1.5.2-incubating.jar >Reporter: Ofer Tal >Priority: Major > > After parsing a sentence with opennlp.tools.parser.Parse.parse() some (many) > Parse children do not have the correct parent set. > Details: > given a Parse node in the tree (let's assume it is in a variable named p) > When iterating over the Parse[] returned by p.getChildren(), checking > p.equals(children[i].getParent()) returns false in many, if not all of the > nodes. > More background -- > to create the parse tree, I used the code: > {code} > opennlp.tools.parser.Parse p = new opennlp.tools.parser.Parse(parseSentence, > new opennlp.tools.util.Span(0, parseSentence.Length), > opennlp.tools.parser.AbstractBottomUpParser.INC_NODE, 1, null); > // create a parse object for each token and add it to the parent > int start = 0; > foreach (string token in tokenizedSentence) > { > { > opennlp.tools.parser.Parse tokenParse = new > opennlp.tools.parser.Parse(parseSentence, > new > opennlp.tools.util.Span(start, start + token.Length), > > opennlp.tools.parser.AbstractBottomUpParser.TOK_NODE, > 0, > 0); > > p.insert(tokenParse); > start += token.Length + 1; > } > } > // fetch 1 possible parse trees > opennlp.tools.parser.Parse[] parses = parser.parse(p, 1); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1187) Issue in finding accuracy of model
[ https://issues.apache.org/jira/browse/OPENNLP-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645353#comment-17645353 ] ASF GitHub Bot commented on OPENNLP-1187: - mawiesne opened a new pull request, #452: URL: https://github.com/apache/opennlp/pull/452 Change - - adds `NaiveBayesEvalParameters` by computation of the outcome total values, as proposed by reporter 'agarg98' Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Issue in finding accuracy of model > -- > > Key: OPENNLP-1187 > URL: https://issues.apache.org/jira/browse/OPENNLP-1187 > Project: OpenNLP > Issue Type: Bug > Components: Doccat, Machine Learning >Affects Versions: 1.8.4 >Reporter: Aman Garg >Priority: Major > Fix For: 2.1.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > the trainingStats function in NaiveBayesTrainer class is not working properly > and display wrong result. > In findParameters(), at line 154 i.e. > EvalParameters evalParams = new EvalParameters(params, numOutcomes); > should be replaced by following block: > > double[] outcomeTotals = new double[outcomeLabels.length]; > for (int i = 0; i < params.length; ++i) { > Context context = params[i]; > for (int j = 0; j < context.getOutcomes().length; ++j) { > int outcome = context.getOutcomes()[j]; > double count = context.getParameters()[j]; > outcomeTotals[outcome] += count; > } > } > evalParams = new NaiveBayesEvalParameters(params, > outcomeLabels.length, outcomeTotals, predLabels.length); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (OPENNLP-1187) Issue in finding accuracy of model
[ https://issues.apache.org/jira/browse/OPENNLP-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reassigned OPENNLP-1187: --- Assignee: Martin Wiesner > Issue in finding accuracy of model > -- > > Key: OPENNLP-1187 > URL: https://issues.apache.org/jira/browse/OPENNLP-1187 > Project: OpenNLP > Issue Type: Bug > Components: Doccat, Machine Learning >Affects Versions: 1.8.4 >Reporter: Aman Garg >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.1.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > the trainingStats function in NaiveBayesTrainer class is not working properly > and display wrong result. > In findParameters(), at line 154 i.e. > EvalParameters evalParams = new EvalParameters(params, numOutcomes); > should be replaced by following block: > > double[] outcomeTotals = new double[outcomeLabels.length]; > for (int i = 0; i < params.length; ++i) { > Context context = params[i]; > for (int j = 0; j < context.getOutcomes().length; ++j) { > int outcome = context.getOutcomes()[j]; > double count = context.getParameters()[j]; > outcomeTotals[outcome] += count; > } > } > evalParams = new NaiveBayesEvalParameters(params, > outcomeLabels.length, outcomeTotals, predLabels.length); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1187) Issue in finding accuracy of model
[ https://issues.apache.org/jira/browse/OPENNLP-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1187: Fix Version/s: 2.1.1 > Issue in finding accuracy of model > -- > > Key: OPENNLP-1187 > URL: https://issues.apache.org/jira/browse/OPENNLP-1187 > Project: OpenNLP > Issue Type: Bug > Components: Doccat, Machine Learning >Affects Versions: 1.8.4 >Reporter: Aman Garg >Priority: Major > Fix For: 2.1.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > the trainingStats function in NaiveBayesTrainer class is not working properly > and display wrong result. > In findParameters(), at line 154 i.e. > EvalParameters evalParams = new EvalParameters(params, numOutcomes); > should be replaced by following block: > > double[] outcomeTotals = new double[outcomeLabels.length]; > for (int i = 0; i < params.length; ++i) { > Context context = params[i]; > for (int j = 0; j < context.getOutcomes().length; ++j) { > int outcome = context.getOutcomes()[j]; > double count = context.getParameters()[j]; > outcomeTotals[outcome] += count; > } > } > evalParams = new NaiveBayesEvalParameters(params, > outcomeLabels.length, outcomeTotals, predLabels.length); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1182) LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
[ https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1182: Priority: Minor (was: Major) > LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise > --- > > Key: OPENNLP-1182 > URL: https://issues.apache.org/jira/browse/OPENNLP-1182 > Project: OpenNLP > Issue Type: Bug >Affects Versions: 1.8.4 >Reporter: Steven Rowe >Priority: Minor > > Contrary to the docs (see below), LanguageDetectorConverterTool doesn't > actually do anything at all; the class is empty. > {quote} > The following sequence of commands shows how to convert the Leipzig Corpora > collection at folder leipzig-train/ to the default Language Detector format, > by creating groups of 5 sentences as documents and limiting to 1 > documents per language. Them, it shuffles the result and select the first > 10 lines as train corpus and the last 2 as evaluation corpus: > {noformat} > $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ > -sentencesPerSample 5 -samplesPerLanguage 1 > leipzig.txt > $ perl -MList::Util=shuffle -e 'print shuffle();' < leipzig.txt > > leipzig_shuf.txt > $ head -10 < leipzig_shuf.txt > leipzig.train > $ tail -2 < leipzig_shuf.txt > leipzig.eval > {noformat} > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1182) LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
[ https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1182: Component/s: Language Detector > LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise > --- > > Key: OPENNLP-1182 > URL: https://issues.apache.org/jira/browse/OPENNLP-1182 > Project: OpenNLP > Issue Type: Bug > Components: Language Detector >Affects Versions: 1.8.4 >Reporter: Steven Rowe >Priority: Minor > > Contrary to the docs (see below), LanguageDetectorConverterTool doesn't > actually do anything at all; the class is empty. > {quote} > The following sequence of commands shows how to convert the Leipzig Corpora > collection at folder leipzig-train/ to the default Language Detector format, > by creating groups of 5 sentences as documents and limiting to 1 > documents per language. Them, it shuffles the result and select the first > 10 lines as train corpus and the last 2 as evaluation corpus: > {noformat} > $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ > -sentencesPerSample 5 -samplesPerLanguage 1 > leipzig.txt > $ perl -MList::Util=shuffle -e 'print shuffle();' < leipzig.txt > > leipzig_shuf.txt > $ head -10 < leipzig_shuf.txt > leipzig.train > $ tail -2 < leipzig_shuf.txt > leipzig.eval > {noformat} > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1220) Add support for Byte Pair Encoding (BPE)
[ https://issues.apache.org/jira/browse/OPENNLP-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1220: Priority: Minor (was: Major) > Add support for Byte Pair Encoding (BPE) > > > Key: OPENNLP-1220 > URL: https://issues.apache.org/jira/browse/OPENNLP-1220 > Project: OpenNLP > Issue Type: Improvement >Reporter: Jörn Kottmann >Priority: Minor > > It would be nice to add support for BPE to OpenNLP: > [https://arxiv.org/pdf/1508.07909.pdf] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1381) OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release
[ https://issues.apache.org/jira/browse/OPENNLP-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1381: Description: As of OpenJDK 18, the Security Manager has been deprecated (see [JEP-411|https://openjdk.org/jeps/411] which fails all tests in CLITest.java: java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release at java.base/java.lang.System.setSecurityManager(System.java:416) at opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) at com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38) at com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11) at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35) at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235) at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54) was: As of OpenJDK 18, the Security Manager has been deprecated (see [JEP-411|[https://openjdk.org/jeps/411]]) which fails all tests in CLITest.java: java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release at java.base/java.lang.System.setSecurityManager(System.java:416) at opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at
[jira] [Commented] (OPENNLP-1381) OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release
[ https://issues.apache.org/jira/browse/OPENNLP-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645340#comment-17645340 ] Richard Zowalla commented on OPENNLP-1381: -- We are testing for exit codes from System.exit(...) here. A workaround would be to use a similar appraoch as Picocli does: https://github.com/remkop/picocli/issues/1503 , i.e. setting -Djava.security.manager=allow before running the tests (programmatically). An alternative would be to evaluate how we can test for exit codes _or_ migrate that kind of test to a (scripted) CI-based integration test on Jenkins or GH actions. > OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The > Security Manager is deprecated and will be removed in a future release > --- > > Key: OPENNLP-1381 > URL: https://issues.apache.org/jira/browse/OPENNLP-1381 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.1.0 > Environment: MacOS Monterey 12.5 Intel (same issue in M1 chip) > brew-installed OpenJDK 18.0.2 >Reporter: Bertrand Rigaldies >Priority: Major > Fix For: 2.1.1 > > Attachments: Screen Shot 2022-07-31 at 9.10.48 PM.png, Screen Shot > 2022-07-31 at 9.11.09 PM.png > > > As of OpenJDK 18, the Security Manager has been deprecated (see > [JEP-411|[https://openjdk.org/jeps/411]]) which fails all tests in > CLITest.java: > java.lang.UnsupportedOperationException: The Security Manager is deprecated > and will be removed in a future release > at java.base/java.lang.System.setSecurityManager(System.java:416) > at > opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at org.junit.runners.ParentRunner.run(ParentRunner.java:413) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) > at > com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38) > at > com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11) > at > com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35) > at > com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235) > at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1381) OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release
[ https://issues.apache.org/jira/browse/OPENNLP-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1381: Description: As of OpenJDK 18, the Security Manager has been deprecated (see [JEP-411|[https://openjdk.org/jeps/411]]) which fails all tests in CLITest.java: java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release at java.base/java.lang.System.setSecurityManager(System.java:416) at opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) at com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38) at com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11) at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35) at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235) at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54) was: As of OpenJDK 18, the Security Manager has been deprecated (see [JEP 411] ([https://openjdk.org/jeps/411])[,|https://openjdk.org/jeps/411)),] which fails all tests in CLITest.java: java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release at java.base/java.lang.System.setSecurityManager(System.java:416) at opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at
[jira] [Updated] (OPENNLP-1381) OpenJDK 18+: CLITest fails with java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release
[ https://issues.apache.org/jira/browse/OPENNLP-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1381: Description: As of OpenJDK 18, the Security Manager has been deprecated (see [JEP 411] ([https://openjdk.org/jeps/411])[,|https://openjdk.org/jeps/411)),] which fails all tests in CLITest.java: java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release at java.base/java.lang.System.setSecurityManager(System.java:416) at opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) at com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38) at com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11) at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35) at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235) at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54) was: As of OpenJDK 18, the Security Manager has been deprecated (see [JEP 411]([https://openjdk.org/jeps/411)),] which fails all tests in CLITest.java: java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release at java.base/java.lang.System.setSecurityManager(System.java:416) at opennlp.tools.cmdline.CLITest.installNoExitSecurityManager(CLITest.java:66) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at
[jira] [Resolved] (OPENNLP-217) Add Detokenizer Dictionary section
[ https://issues.apache.org/jira/browse/OPENNLP-217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-217. Fix Version/s: 1.9.4 Resolution: Fixed I checked this for version 2.1.0, 2.0.0 and 1.9.4: see for instance: [https://opennlp.apache.org/docs/2.1.0/manual/opennlp.html#tools.tokenizer.detokenizing.dict] Resolving as the requested documentation is available since version 1.9.4 > Add Detokenizer Dictionary section > -- > > Key: OPENNLP-217 > URL: https://issues.apache.org/jira/browse/OPENNLP-217 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Reporter: Jörn Kottmann >Priority: Major > Labels: help-wanted > Fix For: 1.9.4 > > > The documentation is lacking a section about the detokenizer dictionary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1406) Enhance JavaDoc in opennlp.tools.parser package
[ https://issues.apache.org/jira/browse/OPENNLP-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645270#comment-17645270 ] ASF GitHub Bot commented on OPENNLP-1406: - mawiesne commented on code in PR #449: URL: https://github.com/apache/opennlp/pull/449#discussion_r1044397338 ## opennlp-tools/src/main/java/opennlp/tools/parser/ParserModel.java: ## @@ -41,16 +41,15 @@ import opennlp.tools.util.model.POSModelSerializer; /** - * This is an abstract base class for {@link ParserModel} implementations. + * This is the default {@link ParserModel} implementation. */ -// TODO: Model should validate the artifact map Review Comment: No, as this is done in `ParserModel`. That's why I removed the orphaned TODO here. > Enhance JavaDoc in opennlp.tools.parser package > --- > > Key: OPENNLP-1406 > URL: https://issues.apache.org/jira/browse/OPENNLP-1406 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation, Parser >Affects Versions: 2.1.0 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.1.1 > > > The JavaDoc the _opennlp.tools.parser_ package suffers from several > inconsistencies and missing descriptions. Moreover, several typos are present > that need sanitizing. > It needs enhancements and/or additions to provide more clarity for readers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1408) Enhance JavaDoc in opennlp.tools.doccat package
[ https://issues.apache.org/jira/browse/OPENNLP-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645262#comment-17645262 ] ASF GitHub Bot commented on OPENNLP-1408: - mawiesne opened a new pull request, #451: URL: https://github.com/apache/opennlp/pull/451 Change - - adds missing JavaDoc - improves existing documentation for clarity - removes superfluous text - adds 'final' modifier where useful and applicable - adds 'Override' annotation where useful and applicable - fixes several typos Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Enhance JavaDoc in opennlp.tools.doccat package > --- > > Key: OPENNLP-1408 > URL: https://issues.apache.org/jira/browse/OPENNLP-1408 > Project: OpenNLP > Issue Type: Improvement > Components: Doccat, Documentation >Affects Versions: 2.1.0 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.1.1 > > > The JavaDoc the _opennlp.tools.doccat_ package suffers from several > inconsistencies and missing descriptions. Moreover, several typos are present > that need sanitizing. > It needs enhancements and/or additions to provide more clarity for readers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training
[ https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick closed OPENNLP-1218. -- > All binary implementation of AbstractModelWriter and DataReader throws > java.io.UTFDataFormatException when large dataset is used for training > -- > > Key: OPENNLP-1218 > URL: https://issues.apache.org/jira/browse/OPENNLP-1218 > Project: OpenNLP > Issue Type: Bug >Affects Versions: 1.8.4 >Reporter: Sudheer Prem >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.1.0 > > > All binary implementation of AbstractModelWriter and DataReader throws > "java.io.UTFDataFormatException: encoded string too long" in the > java.io.DataOutputStream.writeUTF method call when a large dataset (more than > 64 KB) is used for training. Looks like, this is a known limitation of > java.io.DataOutputStream.writeUTF method. > Following is the stack trace: > java.io.UTFDataFormatException: encoded string too long: 97519 bytes > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) > at > opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67) > at > opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169) > at > opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75) > at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29) > at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597) > > The implementation should use byte array to resolve this issue. > Following is the fix to resolve this issue. > > public void writeUTF(String s) throws java.io.IOException { > byte[] ctxByte = s.getBytes("utf-8"); > output.writeInt(ctxByte.length); > output.write(ctxByte); > //output.writeUTF(s); > } > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-216) Add Detokenizer API section
[ https://issues.apache.org/jira/browse/OPENNLP-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-216. Fix Version/s: 1.9.4 Assignee: Martin Wiesner Resolution: Fixed This was handled via PR [https://github.com/apache/opennlp/pull/388.|https://github.com/apache/opennlp/pull/388] Fix version is 1.9.4 - checked via [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.tokenizer.detokenizing.api] Resolving as fixed. > Add Detokenizer API section > --- > > Key: OPENNLP-216 > URL: https://issues.apache.org/jira/browse/OPENNLP-216 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Reporter: Jörn Kottmann >Assignee: Martin Wiesner >Priority: Major > Labels: help-wanted, pull-request-available > Fix For: 1.9.4 > > > The documentation is lacking a section about the detokenizer API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1406) Enhance JavaDoc in opennlp.tools.parser package
[ https://issues.apache.org/jira/browse/OPENNLP-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645242#comment-17645242 ] ASF GitHub Bot commented on OPENNLP-1406: - rzo1 commented on code in PR #449: URL: https://github.com/apache/opennlp/pull/449#discussion_r1044334326 ## opennlp-tools/src/main/java/opennlp/tools/parser/ParserEventTypeEnum.java: ## @@ -19,13 +19,14 @@ package opennlp.tools.parser; /** - * Enumerated type of event types for the parser. + * Enumeration of event types for a {@link Parser}. */ public enum ParserEventTypeEnum { BUILD, CHECK, + // TODO Add reason why those enum values are deprecated Review Comment: +1 ## opennlp-tools/src/main/java/opennlp/tools/parser/ParserModel.java: ## @@ -41,16 +41,15 @@ import opennlp.tools.util.model.POSModelSerializer; /** - * This is an abstract base class for {@link ParserModel} implementations. + * This is the default {@link ParserModel} implementation. */ -// TODO: Model should validate the artifact map Review Comment: Is this still a valid todo? > Enhance JavaDoc in opennlp.tools.parser package > --- > > Key: OPENNLP-1406 > URL: https://issues.apache.org/jira/browse/OPENNLP-1406 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation, Parser >Affects Versions: 2.1.0 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.1.1 > > > The JavaDoc the _opennlp.tools.parser_ package suffers from several > inconsistencies and missing descriptions. Moreover, several typos are present > that need sanitizing. > It needs enhancements and/or additions to provide more clarity for readers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-216) Add Detokenizer API section
[ https://issues.apache.org/jira/browse/OPENNLP-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-216: --- Labels: help-wanted pull-request-available (was: help-wanted) > Add Detokenizer API section > --- > > Key: OPENNLP-216 > URL: https://issues.apache.org/jira/browse/OPENNLP-216 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Reporter: Jörn Kottmann >Priority: Major > Labels: help-wanted, pull-request-available > > The documentation is lacking a section about the detokenizer API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1233) Penn Treebank tag set link insufficient
[ https://issues.apache.org/jira/browse/OPENNLP-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645239#comment-17645239 ] Martin Wiesner commented on OPENNLP-1233: - [~mvsagar] The URL (https://www.clips.uantwerpen.be/pages/mbsp-tags) you mentioned in the description is no longer working. Can you provide an alternative for this? > Penn Treebank tag set link insufficient > --- > > Key: OPENNLP-1233 > URL: https://issues.apache.org/jira/browse/OPENNLP-1233 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.9.1 >Reporter: Vidyasagar Mundroy >Priority: Trivial > > The link > "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html; > provided with name "Penn Treebank tag set" in sections Tagging and Chunking > of manual does not list tags used for chunking. A proper link is needed for > anybody to develop applications using openNLP. Found some info on missing > tags at "https://www.clips.uantwerpen.be/pages/mbsp-tags; but not sure if > the link can be used in the manual. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1237) Document Categorizer example references non-existant method signature
[ https://issues.apache.org/jira/browse/OPENNLP-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1237. - Fix Version/s: 2.0.0 Resolution: Fixed This was resolved with commit 6c439ae9 by [~alanwang926] in OPENNLP-1319. Documentation example is (now) consistent since version 2.0.0. Resolving as (indirectly) fixed. > Document Categorizer example references non-existant method signature > - > > Key: OPENNLP-1237 > URL: https://issues.apache.org/jira/browse/OPENNLP-1237 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.9.1 >Reporter: Nick Burch >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.0.0 > > > In the Document Categorizer section of the manual > https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat there > is a code snippet in the training section: > {code} > model = DocumentCategorizerME.train("en", sampleStream); > {code} > However, no matching method is present in the javadocs at > https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html > . The nearest seems to be one that takes two additional required parameters: > https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html#train-java.lang.String-opennlp.tools.util.ObjectStream-opennlp.tools.util.TrainingParameters-opennlp.tools.doccat.DoccatFactory- > It looks like the code snippet is out of date, and needs updating to cover > the API changes that seem to have happened -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (OPENNLP-1237) Document Categorizer example references non-existant method signature
[ https://issues.apache.org/jira/browse/OPENNLP-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reassigned OPENNLP-1237: --- Assignee: Martin Wiesner > Document Categorizer example references non-existant method signature > - > > Key: OPENNLP-1237 > URL: https://issues.apache.org/jira/browse/OPENNLP-1237 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.9.1 >Reporter: Nick Burch >Assignee: Martin Wiesner >Priority: Major > > In the Document Categorizer section of the manual > https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat there > is a code snippet in the training section: > {code} > model = DocumentCategorizerME.train("en", sampleStream); > {code} > However, no matching method is present in the javadocs at > https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html > . The nearest seems to be one that takes two additional required parameters: > https://opennlp.apache.org/docs/1.9.1/apidocs/opennlp-tools/opennlp/tools/doccat/DocumentCategorizerME.html#train-java.lang.String-opennlp.tools.util.ObjectStream-opennlp.tools.util.TrainingParameters-opennlp.tools.doccat.DoccatFactory- > It looks like the code snippet is out of date, and needs updating to cover > the API changes that seem to have happened -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training
[ https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627227#comment-17627227 ] Martin Wiesner edited comment on OPENNLP-1218 at 12/9/22 10:46 AM: --- This is an open duplicate of #OPENNLP-1366 which got resolved recently. The fix(es) will be released with OpenNLP version 2.1.0. [~jzemerick]: Could you mark this one as resolved / duplicate with the above fix version? I applied the fix in #OPENNLP-1366 also for _BinaryNaiveBayesModelWriter_ as reported by [~sudheerprem] for his case. Details see: [https://github.com/apache/opennlp/pull/427] was (Author: mawiesne): This is an open duplicate of #OPENNLP-1366 which got resolved recently. The fix(es) will be released with OpenNLP version 2.0.1. [~jzemerick]: Could you mark this one as resolved / duplicate with the above fix version? I applied the fix in #OPENNLP-1366 also for _BinaryNaiveBayesModelWriter_ as reported by [~sudheerprem] for his case. Details see: [https://github.com/apache/opennlp/pull/427] > All binary implementation of AbstractModelWriter and DataReader throws > java.io.UTFDataFormatException when large dataset is used for training > -- > > Key: OPENNLP-1218 > URL: https://issues.apache.org/jira/browse/OPENNLP-1218 > Project: OpenNLP > Issue Type: Bug >Affects Versions: 1.8.4 >Reporter: Sudheer Prem >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.1.0 > > > All binary implementation of AbstractModelWriter and DataReader throws > "java.io.UTFDataFormatException: encoded string too long" in the > java.io.DataOutputStream.writeUTF method call when a large dataset (more than > 64 KB) is used for training. Looks like, this is a known limitation of > java.io.DataOutputStream.writeUTF method. > Following is the stack trace: > java.io.UTFDataFormatException: encoded string too long: 97519 bytes > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) > at > opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67) > at > opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169) > at > opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75) > at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29) > at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597) > > The implementation should use byte array to resolve this issue. > Following is the fix to resolve this issue. > > public void writeUTF(String s) throws java.io.IOException { > byte[] ctxByte = s.getBytes("utf-8"); > output.writeInt(ctxByte.length); > output.write(ctxByte); > //output.writeUTF(s); > } > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training
[ https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1218. - Fix Version/s: 2.1.0 Assignee: Martin Wiesner Resolution: Duplicate > All binary implementation of AbstractModelWriter and DataReader throws > java.io.UTFDataFormatException when large dataset is used for training > -- > > Key: OPENNLP-1218 > URL: https://issues.apache.org/jira/browse/OPENNLP-1218 > Project: OpenNLP > Issue Type: Bug >Affects Versions: 1.8.4 >Reporter: Sudheer Prem >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.1.0 > > > All binary implementation of AbstractModelWriter and DataReader throws > "java.io.UTFDataFormatException: encoded string too long" in the > java.io.DataOutputStream.writeUTF method call when a large dataset (more than > 64 KB) is used for training. Looks like, this is a known limitation of > java.io.DataOutputStream.writeUTF method. > Following is the stack trace: > java.io.UTFDataFormatException: encoded string too long: 97519 bytes > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) > at > opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67) > at > opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169) > at > opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75) > at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29) > at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597) > > The implementation should use byte array to resolve this issue. > Following is the fix to resolve this issue. > > public void writeUTF(String s) throws java.io.IOException { > byte[] ctxByte = s.getBytes("utf-8"); > output.writeInt(ctxByte.length); > output.write(ctxByte); > //output.writeUTF(s); > } > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (OPENNLP-1395) Release OpenNLP 2.1.0
[ https://issues.apache.org/jira/browse/OPENNLP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reopened OPENNLP-1395: - > Release OpenNLP 2.1.0 > - > > Key: OPENNLP-1395 > URL: https://issues.apache.org/jira/browse/OPENNLP-1395 > Project: OpenNLP > Issue Type: Task > Components: Build, Packaging and Test >Reporter: Jeff Zemerick >Assignee: Jeff Zemerick >Priority: Major > Fix For: 2.0.1, 2.1.0 > > > This ticket is to track the work to release OpenNLP 2.1.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1395) Release OpenNLP 2.1.0
[ https://issues.apache.org/jira/browse/OPENNLP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1395. - Resolution: Fixed > Release OpenNLP 2.1.0 > - > > Key: OPENNLP-1395 > URL: https://issues.apache.org/jira/browse/OPENNLP-1395 > Project: OpenNLP > Issue Type: Task > Components: Build, Packaging and Test >Reporter: Jeff Zemerick >Assignee: Jeff Zemerick >Priority: Major > Fix For: 2.0.1, 2.1.0 > > > This ticket is to track the work to release OpenNLP 2.1.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (OPENNLP-1395) Release OpenNLP 2.1.0
[ https://issues.apache.org/jira/browse/OPENNLP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner closed OPENNLP-1395. --- OpenNLP version 2.1.0 is out for some days now, see: [https://opennlp.apache.org/download.html.|https://opennlp.apache.org/download.html] Closing issue as resolved. > Release OpenNLP 2.1.0 > - > > Key: OPENNLP-1395 > URL: https://issues.apache.org/jira/browse/OPENNLP-1395 > Project: OpenNLP > Issue Type: Task > Components: Build, Packaging and Test >Reporter: Jeff Zemerick >Assignee: Jeff Zemerick >Priority: Major > Fix For: 2.0.1, 2.1.0 > > > This ticket is to track the work to release OpenNLP 2.1.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OPENNLP-1395) Release OpenNLP 2.1.0
[ https://issues.apache.org/jira/browse/OPENNLP-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner resolved OPENNLP-1395. - Fix Version/s: 2.1.0 Resolution: Done > Release OpenNLP 2.1.0 > - > > Key: OPENNLP-1395 > URL: https://issues.apache.org/jira/browse/OPENNLP-1395 > Project: OpenNLP > Issue Type: Task > Components: Build, Packaging and Test >Reporter: Jeff Zemerick >Assignee: Jeff Zemerick >Priority: Major > Fix For: 2.0.1, 2.1.0 > > > This ticket is to track the work to release OpenNLP 2.1.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OPENNLP-1307) Incorrect code example for Document Categorization (9.3)
[ https://issues.apache.org/jira/browse/OPENNLP-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner updated OPENNLP-1307: Fix Version/s: 2.1.1 > Incorrect code example for Document Categorization (9.3) > > > Key: OPENNLP-1307 > URL: https://issues.apache.org/jira/browse/OPENNLP-1307 > Project: OpenNLP > Issue Type: Documentation > Components: Doccat >Affects Versions: 1.9.3 > Environment: N/A >Reporter: John Slocum >Assignee: Martin Wiesner >Priority: Major > Labels: DocumentCategorizerME, documentation > Fix For: 2.1.1 > > Original Estimate: 2m > Remaining Estimate: 2m > > in > [https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,] > the code example feeds a String into DocumentCategorizerME.categorize(). The > method itself takes an array. I flagged priority as Major because this was a > killer - obviously it's a self-documenting bug when you run it, but I made > the mistake of assuming that the array actually needed would be an array of > documents - instead it needs to be an array of tokens from a single document, > i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting > with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) > before figuring this one out. > > Current(wrong) version: > > {code:java} > String inputText = ... > DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); > double[] outcomes = myCategorizer.categorize(inputText); > String category = myCategorizer.getBestCategory(outcomes); > {code} > > Should be more like: > > {code:java} > String inputText = ... // sanitized document to be categorized > DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); > double[] outcomes = myCategorizer.categorize(inputText.split(" "); > String category = myCategorizer.getBestCategory(outcomes); > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (OPENNLP-1307) Incorrect code example for Document Categorization (9.3)
[ https://issues.apache.org/jira/browse/OPENNLP-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reassigned OPENNLP-1307: --- Assignee: Martin Wiesner > Incorrect code example for Document Categorization (9.3) > > > Key: OPENNLP-1307 > URL: https://issues.apache.org/jira/browse/OPENNLP-1307 > Project: OpenNLP > Issue Type: Documentation > Components: Doccat >Affects Versions: 1.9.3 > Environment: N/A >Reporter: John Slocum >Assignee: Martin Wiesner >Priority: Major > Labels: DocumentCategorizerME, documentation > Original Estimate: 2m > Remaining Estimate: 2m > > in > [https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,] > the code example feeds a String into DocumentCategorizerME.categorize(). The > method itself takes an array. I flagged priority as Major because this was a > killer - obviously it's a self-documenting bug when you run it, but I made > the mistake of assuming that the array actually needed would be an array of > documents - instead it needs to be an array of tokens from a single document, > i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting > with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) > before figuring this one out. > > Current(wrong) version: > > {code:java} > String inputText = ... > DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); > double[] outcomes = myCategorizer.categorize(inputText); > String category = myCategorizer.getBestCategory(outcomes); > {code} > > Should be more like: > > {code:java} > String inputText = ... // sanitized document to be categorized > DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); > double[] outcomes = myCategorizer.categorize(inputText.split(" "); > String category = myCategorizer.getBestCategory(outcomes); > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (OPENNLP-1405) Enhance JavaDoc in opennlp.tools.tokenize package
[ https://issues.apache.org/jira/browse/OPENNLP-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reassigned OPENNLP-1405: --- Assignee: Martin Wiesner > Enhance JavaDoc in opennlp.tools.tokenize package > - > > Key: OPENNLP-1405 > URL: https://issues.apache.org/jira/browse/OPENNLP-1405 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation, Tokenizer >Affects Versions: 2.0.0 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.1.1 > > > The JavaDoc the _opennlp.tools.tokenize_ package suffers from several > inconsistencies and missing descriptions. Moreover, several typos are present > that need sanitizing. > It needs enhancements and/or additions to provide more clarity for readers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (OPENNLP-1406) Enhance JavaDoc in opennlp.tools.parser package
[ https://issues.apache.org/jira/browse/OPENNLP-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reassigned OPENNLP-1406: --- Assignee: Martin Wiesner > Enhance JavaDoc in opennlp.tools.parser package > --- > > Key: OPENNLP-1406 > URL: https://issues.apache.org/jira/browse/OPENNLP-1406 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation, Parser >Affects Versions: 2.1.0 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.1.1 > > > The JavaDoc the _opennlp.tools.parser_ package suffers from several > inconsistencies and missing descriptions. Moreover, several typos are present > that need sanitizing. > It needs enhancements and/or additions to provide more clarity for readers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OPENNLP-1408) Enhance JavaDoc in opennlp.tools.doccat package
Martin Wiesner created OPENNLP-1408: --- Summary: Enhance JavaDoc in opennlp.tools.doccat package Key: OPENNLP-1408 URL: https://issues.apache.org/jira/browse/OPENNLP-1408 Project: OpenNLP Issue Type: Improvement Components: Doccat, Documentation Affects Versions: 2.1.0 Reporter: Martin Wiesner Assignee: Martin Wiesner Fix For: 2.1.1 The JavaDoc the _opennlp.tools.doccat_ package suffers from several inconsistencies and missing descriptions. Moreover, several typos are present that need sanitizing. It needs enhancements and/or additions to provide more clarity for readers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (OPENNLP-1404) Enhance JavaDoc in opennlp.tools.postag package
[ https://issues.apache.org/jira/browse/OPENNLP-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner reassigned OPENNLP-1404: --- Assignee: Martin Wiesner > Enhance JavaDoc in opennlp.tools.postag package > --- > > Key: OPENNLP-1404 > URL: https://issues.apache.org/jira/browse/OPENNLP-1404 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation, POS Tagger >Affects Versions: 2.1.0 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.1.1 > > > The JavaDoc of the _opennlp.tools.postag_ package suffers from several > inconsistencies and missing descriptions. Moreover, several typos are present > that need sanitizing. > It needs enhancements and/or additions to provide more clarity for readers of > this part of the OpenNLP API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1307) Incorrect code example for Document Categorization (9.3)
[ https://issues.apache.org/jira/browse/OPENNLP-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645226#comment-17645226 ] ASF GitHub Bot commented on OPENNLP-1307: - mawiesne opened a new pull request, #450: URL: https://github.com/apache/opennlp/pull/450 Change - - fixes `doccat.xml` as proposed in OPENNLP-1307; the current example in the DOC was inconsistent with `DocumentCategorizerME#categorize(..)` and required a change Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Incorrect code example for Document Categorization (9.3) > > > Key: OPENNLP-1307 > URL: https://issues.apache.org/jira/browse/OPENNLP-1307 > Project: OpenNLP > Issue Type: Documentation > Components: Doccat >Affects Versions: 1.9.3 > Environment: N/A >Reporter: John Slocum >Priority: Major > Labels: DocumentCategorizerME, documentation > Original Estimate: 2m > Remaining Estimate: 2m > > in > [https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,] > the code example feeds a String into DocumentCategorizerME.categorize(). The > method itself takes an array. I flagged priority as Major because this was a > killer - obviously it's a self-documenting bug when you run it, but I made > the mistake of assuming that the array actually needed would be an array of > documents - instead it needs to be an array of tokens from a single document, > i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting > with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) > before figuring this one out. > > Current(wrong) version: > > {code:java} > String inputText = ... > DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); > double[] outcomes = myCategorizer.categorize(inputText); > String category = myCategorizer.getBestCategory(outcomes); > {code} > > Should be more like: > > {code:java} > String inputText = ... // sanitized document to be categorized > DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); > double[] outcomes = myCategorizer.categorize(inputText.split(" "); > String category = myCategorizer.getBestCategory(outcomes); > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1406) Enhance JavaDoc in opennlp.tools.parser package
[ https://issues.apache.org/jira/browse/OPENNLP-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645203#comment-17645203 ] ASF GitHub Bot commented on OPENNLP-1406: - mawiesne opened a new pull request, #449: URL: https://github.com/apache/opennlp/pull/449 Change - - adds missing JavaDoc - improves existing documentation for clarity - removes superfluous text - removes orphaned (commented) code fragments in `..parser.treeinsert.ParserEventStream` - fixes a missing variable assignment in `ParserCrossValidator` (a hidden bug) - adds 'final' modifier where useful and applicable - adds 'Override' annotation where useful and applicable - fixes several typos - corrects some inconsistencies in the `opennlp.tools.chunker` and `opennlp.tools.langdetect` package Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Enhance JavaDoc in opennlp.tools.parser package > --- > > Key: OPENNLP-1406 > URL: https://issues.apache.org/jira/browse/OPENNLP-1406 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation, Parser >Affects Versions: 2.1.0 >Reporter: Martin Wiesner >Priority: Minor > Fix For: 2.1.1 > > > The JavaDoc the _opennlp.tools.parser_ package suffers from several > inconsistencies and missing descriptions. Moreover, several typos are present > that need sanitizing. > It needs enhancements and/or additions to provide more clarity for readers. -- This message was sent by Atlassian Jira (v8.20.10#820010)