[jira] [Commented] (OPENNLP-1173) Prepare the 1.8.4 release

2017-12-20 Thread Jeff Zemerick (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299049#comment-16299049
 ] 

Jeff Zemerick commented on OPENNLP-1173:


Will base off commit c2f1b685abecfc11de76ffd0a28771f41b566782

> Prepare the 1.8.4 release
> -
>
> Key: OPENNLP-1173
> URL: https://issues.apache.org/jira/browse/OPENNLP-1173
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Jeff Zemerick
>Assignee: Jeff Zemerick
>
> This is a task to track the 1.8.4 release of OpenNLP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1173) Prepare the 1.8.4 release

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick updated OPENNLP-1173:
---
Description: This is a task to track the 1.8.4 release of OpenNLP.  (was: 
Prepare the 1.8.4 release.)

> Prepare the 1.8.4 release
> -
>
> Key: OPENNLP-1173
> URL: https://issues.apache.org/jira/browse/OPENNLP-1173
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Jeff Zemerick
>Assignee: Jeff Zemerick
>
> This is a task to track the 1.8.4 release of OpenNLP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1082) SentenceSampleStream should add EOS to samples if missing

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick updated OPENNLP-1082:
---
Fix Version/s: (was: 1.8.4)
   1.8.5

> SentenceSampleStream should add EOS to samples if missing
> -
>
> Key: OPENNLP-1082
> URL: https://issues.apache.org/jira/browse/OPENNLP-1082
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector
>Reporter: William Colen
>Assignee: William Colen
> Fix For: 1.8.5
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-936) Add thread safe versions of some tools (ME sentence detection, tokenization, pos tagging)

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick updated OPENNLP-936:
--
Fix Version/s: (was: 1.8.4)
   1.8.5

> Add thread safe versions of some tools (ME sentence detection, tokenization, 
> pos tagging)
> -
>
> Key: OPENNLP-936
> URL: https://issues.apache.org/jira/browse/OPENNLP-936
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 1.7.1
>Reporter: Thilo Goetz
>Priority: Minor
> Fix For: 1.8.5
>
>
> As discussed on the mailing list, add thread safe versions of maximum entropy 
> sentence detection, tokenization and pos tagging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1150) TokenNameFinderTrainerTool should use ModelUtil.createDefaultTrainingParameters() when mlParams is null

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick updated OPENNLP-1150:
---
Fix Version/s: (was: 1.8.4)
   1.8.5

> TokenNameFinderTrainerTool should use 
> ModelUtil.createDefaultTrainingParameters() when mlParams is null
> ---
>
> Key: OPENNLP-1150
> URL: https://issues.apache.org/jira/browse/OPENNLP-1150
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 1.8.2
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.8.5
>
>
> Unlike other TrainerTools, TokenNameFinderTrainerTool create an empty 
> TrainingParameters when mlParams is null by calling the constructor. 
> TokenNameFinderTrainerTool should use 
> ModelUtil.createDefaultTrainingParameters() like as other TrainerTools do to 
> initialize mlParams.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-47) Rewrite the CONLL06 documentation based on the tutorial

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick updated OPENNLP-47:
-
Fix Version/s: (was: 1.8.4)
   1.8.5

> Rewrite the CONLL06 documentation based on the tutorial
> ---
>
> Key: OPENNLP-47
> URL: https://issues.apache.org/jira/browse/OPENNLP-47
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: tools-1.5.1-incubating
>Reporter: Joern Kottmann
>  Labels: help-wanted
> Fix For: 1.8.5
>
>
> The CONLL06 documentation should be rewritten the reflect the new converters
> which have been added to OpenNLP after its initial write.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1140) Add 20 newsgroups format support

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick updated OPENNLP-1140:
---
Fix Version/s: (was: 1.8.4)
   1.8.5

> Add 20 newsgroups format support
> 
>
> Key: OPENNLP-1140
> URL: https://issues.apache.org/jira/browse/OPENNLP-1140
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Reporter: Tommaso Teofili
> Fix For: 1.8.5
>
>
> It'd be nice to have support for [20 
> newsgroups|http://qwone.com/~jason/20Newsgroups/] format, especially for 
> evaluating {{DocCat}} models.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-47) Rewrite the CONLL06 documentation based on the tutorial

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick updated OPENNLP-47:
-
Fix Version/s: (was: 1.8.4)

> Rewrite the CONLL06 documentation based on the tutorial
> ---
>
> Key: OPENNLP-47
> URL: https://issues.apache.org/jira/browse/OPENNLP-47
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: tools-1.5.1-incubating
>Reporter: Joern Kottmann
>  Labels: help-wanted
> Fix For: 1.8.4
>
>
> The CONLL06 documentation should be rewritten the reflect the new converters
> which have been added to OpenNLP after its initial write.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-47) Rewrite the CONLL06 documentation based on the tutorial

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick updated OPENNLP-47:
-
Fix Version/s: 1.8.4

> Rewrite the CONLL06 documentation based on the tutorial
> ---
>
> Key: OPENNLP-47
> URL: https://issues.apache.org/jira/browse/OPENNLP-47
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: tools-1.5.1-incubating
>Reporter: Joern Kottmann
>  Labels: help-wanted
> Fix For: 1.8.4
>
>
> The CONLL06 documentation should be rewritten the reflect the new converters
> which have been added to OpenNLP after its initial write.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1171) some tests create temp files and directories but never delete them

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick closed OPENNLP-1171.
--

> some tests create temp files and directories but never delete them
> --
>
> Key: OPENNLP-1171
> URL: https://issues.apache.org/jira/browse/OPENNLP-1171
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Build, Packaging and Test
>Affects Versions: 1.8.3
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 1.8.4
>
>
> Some temporary files and directories that are created in some tests are never 
> deleted and the number of temporary files/directories is increasing after 
> running mvn clean test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1169) WordVectorTable should reference WVs by String

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick closed OPENNLP-1169.
--

> WordVectorTable should reference WVs by String
> --
>
> Key: OPENNLP-1169
> URL: https://issues.apache.org/jira/browse/OPENNLP-1169
> Project: OpenNLP
>  Issue Type: Bug
>  Components: word vectors
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 1.8.4
>
>
> {{WordVectorsTable}} API retrieves {{WordVector}} via {{CharSequence}} , this 
> is suboptimal as implementors could store such WVs via an hash table (e.g. 
> {{MapWordVectorsTable}}) and the value of {{CharSequence#toString}} is not 
> guaranteed to be the stable.
> Additionally it's more common to have words as Strings rather than 
> CharSequences, being that more consistent with other OpenNLP APIs (e.g. 
> {{Tokenizer}} ).
> So {{WordVectorsTable}} should instead retrieve {{WordVector}}s using String.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1167) WordVector toArray methods should be removed

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick closed OPENNLP-1167.
--

> WordVector toArray methods should be removed
> 
>
> Key: OPENNLP-1167
> URL: https://issues.apache.org/jira/browse/OPENNLP-1167
> Project: OpenNLP
>  Issue Type: Task
>  Components: word vectors
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 1.8.4
>
>
> {{WordVector#toDoubleArray}} and {{WordVector#toFloatArray}} always require a 
> copy, have size limitation and therefore should be probably removed.
> Additionally we should think whether it makes sense to keep 
> {{FloatArrayVector#toDoubleBuffer}} and {{DoubleArrayVector#toFloatBuffer}} 
> which also require a copy. The alternative is to throw an 
> {{UnsupportedOperationException}} in such cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (OPENNLP-1168) Resolved concurrency issue in POS tagger.

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick closed OPENNLP-1168.
--

> Resolved concurrency issue in POS tagger.
> -
>
> Key: OPENNLP-1168
> URL: https://issues.apache.org/jira/browse/OPENNLP-1168
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 1.8.4
>Reporter: Niels Schuette
>  Labels: easyfix, patch
> Fix For: 1.8.4
>
>
> We encountered a concurrency issue in the pos tagger module in the class 
> DefaultPOSContextGenerator.
> The issue is demonstrated in DefaultPOSContextGeneratorTest.java. The test 
> "multithreading()" consistently fails on our system with the current code if 
> the number of threads (NUMBER_OF_THREADS) is set to 10. If the number of 
> threads is set to 1 (effectively disabling multithreading), the test 
> consistently passes.
> We resolved the issue by removing a field in DefaultPOSContextGenerator.java.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (OPENNLP-1168) Resolved concurrency issue in POS tagger.

2017-12-20 Thread Jeff Zemerick (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zemerick resolved OPENNLP-1168.

Resolution: Fixed

> Resolved concurrency issue in POS tagger.
> -
>
> Key: OPENNLP-1168
> URL: https://issues.apache.org/jira/browse/OPENNLP-1168
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 1.8.4
>Reporter: Niels Schuette
>  Labels: easyfix, patch
> Fix For: 1.8.4
>
>
> We encountered a concurrency issue in the pos tagger module in the class 
> DefaultPOSContextGenerator.
> The issue is demonstrated in DefaultPOSContextGeneratorTest.java. The test 
> "multithreading()" consistently fails on our system with the current code if 
> the number of threads (NUMBER_OF_THREADS) is set to 10. If the number of 
> threads is set to 1 (effectively disabling multithreading), the test 
> consistently passes.
> We resolved the issue by removing a field in DefaultPOSContextGenerator.java.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1168) Resolved concurrency issue in POS tagger.

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298247#comment-16298247
 ] 

ASF GitHub Bot commented on OPENNLP-1168:
-

kottmann closed pull request #296: OPENNLP-1168: Resolved concurrency issue in 
POS tagger.
URL: https://github.com/apache/opennlp/pull/296
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java
 
b/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java
index 3035ca523..3f4fe97ed 100644
--- 
a/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java
+++ 
b/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java
@@ -43,7 +43,6 @@
   private Object wordsKey;
 
   private Dictionary dict;
-  private String[] dictGram;
 
   /**
* Initializes the current instance.
@@ -62,7 +61,7 @@ public DefaultPOSContextGenerator(Dictionary dict) {
*/
   public DefaultPOSContextGenerator(int cacheSize, Dictionary dict) {
 this.dict = dict;
-dictGram = new String[1];
+
 if (cacheSize > 0) {
   contextsCache = new Cache<>(cacheSize);
 }
@@ -148,8 +147,8 @@ public DefaultPOSContextGenerator(int cacheSize, Dictionary 
dict) {
 e.add("default");
 // add the word itself
 e.add("w=" + lex);
-dictGram[0] = lex;
-if (dict == null || !dict.contains(new StringList(dictGram))) {
+
+if (dict == null || !dict.contains(new StringList(lex))) {
   // do some basic suffix analysis
   String[] suffs = getSuffixes(lex);
   for (int i = 0; i < suffs.length; i++) {
diff --git a/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java 
b/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java
index 1edcf4b5b..4801bbf40 100644
--- a/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java
+++ b/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java
@@ -222,8 +222,8 @@ public void probs(double[] probs) {
   }
 
   public static POSModel train(String languageCode,
-  ObjectStream samples, TrainingParameters trainParams,
-  POSTaggerFactory posFactory) throws IOException {
+   ObjectStream samples, 
TrainingParameters trainParams,
+   POSTaggerFactory posFactory) throws IOException 
{
 
 int beamSize = trainParams.getIntParameter(BeamSearch.BEAM_SIZE_PARAMETER, 
POSTaggerME.DEFAULT_BEAM_SIZE);
 
@@ -288,7 +288,7 @@ public static Dictionary 
buildNGramDictionary(ObjectStream samples, i
   }
 
   public static void populatePOSDictionary(ObjectStream samples,
-  MutableTagDictionary dict, int cutoff) throws IOException {
+   MutableTagDictionary dict, int 
cutoff) throws IOException {
 System.out.println("Expanding POS Dictionary ...");
 long start = System.nanoTime();
 
diff --git 
a/opennlp-tools/src/test/java/opennlp/tools/postag/DefaultPOSContextGeneratorTest.java
 
b/opennlp-tools/src/test/java/opennlp/tools/postag/DefaultPOSContextGeneratorTest.java
new file mode 100644
index 0..450bb2cc3
--- /dev/null
+++ 
b/opennlp-tools/src/test/java/opennlp/tools/postag/DefaultPOSContextGeneratorTest.java
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.postag;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import opennlp.tools.dictionary.Dictionary;
+import opennlp.tools.util.StringList;
+
+/**
+ *
+ * We 

[jira] [Commented] (OPENNLP-1168) Resolved concurrency issue in POS tagger.

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298246#comment-16298246
 ] 

ASF GitHub Bot commented on OPENNLP-1168:
-

kottmann commented on issue #296: OPENNLP-1168: Resolved concurrency issue in 
POS tagger.
URL: https://github.com/apache/opennlp/pull/296#issuecomment-353025713
 
 
   Merged.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Resolved concurrency issue in POS tagger.
> -
>
> Key: OPENNLP-1168
> URL: https://issues.apache.org/jira/browse/OPENNLP-1168
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 1.8.4
>Reporter: Niels Schuette
>  Labels: easyfix, patch
> Fix For: 1.8.4
>
>
> We encountered a concurrency issue in the pos tagger module in the class 
> DefaultPOSContextGenerator.
> The issue is demonstrated in DefaultPOSContextGeneratorTest.java. The test 
> "multithreading()" consistently fails on our system with the current code if 
> the number of threads (NUMBER_OF_THREADS) is set to 10. If the number of 
> threads is set to 1 (effectively disabling multithreading), the test 
> consistently passes.
> We resolved the issue by removing a field in DefaultPOSContextGenerator.java.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1168) Resolved concurrency issue in POS tagger.

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298210#comment-16298210
 ] 

ASF GitHub Bot commented on OPENNLP-1168:
-

kottmann commented on a change in pull request #296: OPENNLP-1168: Resolved 
concurrency issue in POS tagger.
URL: https://github.com/apache/opennlp/pull/296#discussion_r157981899
 
 

 ##
 File path: opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java
 ##
 @@ -222,8 +222,8 @@ public void probs(double[] probs) {
   }
 
   public static POSModel train(String languageCode,
-  ObjectStream samples, TrainingParameters trainParams,
-  POSTaggerFactory posFactory) throws IOException {
+   ObjectStream samples, 
TrainingParameters trainParams,
 
 Review comment:
   These minor formatting changes should be removed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Resolved concurrency issue in POS tagger.
> -
>
> Key: OPENNLP-1168
> URL: https://issues.apache.org/jira/browse/OPENNLP-1168
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 1.8.4
>Reporter: Niels Schuette
>  Labels: easyfix, patch
> Fix For: 1.8.4
>
>
> We encountered a concurrency issue in the pos tagger module in the class 
> DefaultPOSContextGenerator.
> The issue is demonstrated in DefaultPOSContextGeneratorTest.java. The test 
> "multithreading()" consistently fails on our system with the current code if 
> the number of threads (NUMBER_OF_THREADS) is set to 10. If the number of 
> threads is set to 1 (effectively disabling multithreading), the test 
> consistently passes.
> We resolved the issue by removing a field in DefaultPOSContextGenerator.java.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OPENNLP-1130) Sentence detector format support for NKJP

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298201#comment-16298201
 ] 

ASF GitHub Bot commented on OPENNLP-1130:
-

kottmann closed pull request #263: OPENNLP-1130 Sentence detector format 
support for NKJP
URL: https://github.com/apache/opennlp/pull/263
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/cmdline/StreamFactoryRegistry.java 
b/opennlp-tools/src/main/java/opennlp/tools/cmdline/StreamFactoryRegistry.java
index 48b80256f..cd2b4dc69 100644
--- 
a/opennlp-tools/src/main/java/opennlp/tools/cmdline/StreamFactoryRegistry.java
+++ 
b/opennlp-tools/src/main/java/opennlp/tools/cmdline/StreamFactoryRegistry.java
@@ -61,6 +61,7 @@
 import opennlp.tools.formats.letsmt.LetsmtSentenceStreamFactory;
 import opennlp.tools.formats.moses.MosesSentenceSampleStreamFactory;
 import opennlp.tools.formats.muc.Muc6NameSampleStreamFactory;
+import opennlp.tools.formats.nkjp.NKJPSentenceSampleStreamFactory;
 import opennlp.tools.formats.ontonotes.OntoNotesNameSampleStreamFactory;
 import opennlp.tools.formats.ontonotes.OntoNotesPOSSampleStreamFactory;
 import opennlp.tools.formats.ontonotes.OntoNotesParseSampleStreamFactory;
@@ -128,6 +129,7 @@
 IrishSentenceBankSentenceStreamFactory.registerFactory();
 IrishSentenceBankTokenSampleStreamFactory.registerFactory();
 LeipzigLanguageSampleStreamFactory.registerFactory();
+NKJPSentenceSampleStreamFactory.registerFactory();
   }
 
   public static final String DEFAULT_FORMAT = "opennlp";
diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/formats/nkjp/NKJPSegmentationDocument.java
 
b/opennlp-tools/src/main/java/opennlp/tools/formats/nkjp/NKJPSegmentationDocument.java
new file mode 100644
index 0..b532bd941
--- /dev/null
+++ 
b/opennlp-tools/src/main/java/opennlp/tools/formats/nkjp/NKJPSegmentationDocument.java
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.formats.nkjp;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.LinkedHashMap;
+import java.util.Map;
+import javax.xml.parsers.DocumentBuilder;
+import javax.xml.xpath.XPath;
+import javax.xml.xpath.XPathConstants;
+import javax.xml.xpath.XPathExpression;
+import javax.xml.xpath.XPathExpressionException;
+import javax.xml.xpath.XPathFactory;
+
+import org.w3c.dom.Document;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+import org.xml.sax.SAXException;
+
+import opennlp.tools.util.Span;
+import opennlp.tools.util.XmlUtil;
+
+public class NKJPSegmentationDocument {
+
+  public static class Pointer {
+String doc;
+String id;
+int offset;
+int length;
+boolean space_after;
+
+public Pointer(String doc, String id, int offset, int length, boolean 
space_after) {
+  this.doc = doc;
+  this.id = id;
+  this.offset = offset;
+  this.length = length;
+  this.space_after = space_after;
+}
+
+public Span toSpan() {
+  return new Span(this.offset, this.offset + this.length);
+}
+
+@Override
+public String toString() {
+  return doc + "#string-range(" + id + "," + Integer.toString(offset)
+  + "," + Integer.toString(length) + ")";
+}
+  }
+
+  public Map> getSegments() {
+return segments;
+  }
+
+  Map> segments;
+
+  NKJPSegmentationDocument() {
+this.segments = new LinkedHashMap<>();
+  }
+
+  NKJPSegmentationDocument(Map> segments) {
+this();
+this.segments = segments;
+  }
+
+  public static NKJPSegmentationDocument parse(InputStream is) throws 
IOException {
+
+Map> sentences = new LinkedHashMap<>();
+
+try {
+  DocumentBuilder docBuilder = XmlUtil.createDocumentBuilder();;
+  

[jira] [Commented] (OPENNLP-1166) TwoPassDataIndexer fails if features contain \n

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298140#comment-16298140
 ] 

ASF GitHub Bot commented on OPENNLP-1166:
-

kottmann closed pull request #294: OPENNLP-1166: TwoPassDataIndexer fails if 
features contain \n
URL: https://github.com/apache/opennlp/pull/294
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java 
b/opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java
index 5e347e886..4121e36c1 100644
--- a/opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java
+++ b/opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java
@@ -17,13 +17,16 @@
 
 package opennlp.tools.ml.model;
 
-import java.io.BufferedWriter;
+
+import java.io.BufferedInputStream;
+import java.io.BufferedOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
 import java.io.File;
+import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.IOException;
-import java.io.OutputStreamWriter;
-import java.io.Writer;
-import java.nio.charset.StandardCharsets;
+import java.math.BigInteger;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
@@ -59,20 +62,28 @@ public void index(ObjectStream eventStream) throws 
IOException {
 File tmp = File.createTempFile("events", null);
 tmp.deleteOnExit();
 int numEvents;
-try (Writer osw = new BufferedWriter(new OutputStreamWriter(new 
FileOutputStream(tmp),
-StandardCharsets.UTF_8))) {
-  numEvents = computeEventCounts(eventStream, osw, predicateIndex, cutoff);
+BigInteger writeHash;
+HashSumEventStream writeEventStream = new HashSumEventStream(eventStream); 
 // do not close.
+try (DataOutputStream dos = new DataOutputStream(new 
BufferedOutputStream(new FileOutputStream(tmp {
+  numEvents = computeEventCounts(writeEventStream, dos, predicateIndex, 
cutoff);
 }
+writeHash = writeEventStream.calculateHashSum();
+
 display("done. " + numEvents + " events\n");
 
 display("\tIndexing...  ");
 
 List eventsToCompare;
-try (FileEventStream fes = new FileEventStream(tmp)) {
-  eventsToCompare = index(fes, predicateIndex);
+BigInteger readHash = null;
+try (HashSumEventStream readStream = new HashSumEventStream(new 
EventStream(tmp))) {
+  eventsToCompare = index(readStream, predicateIndex);
+  readHash = readStream.calculateHashSum();
 }
-
 tmp.delete();
+
+if (readHash.compareTo(writeHash) != 0)
+  throw new IOException("Event hash for writing and reading events did not 
match.");
+
 display("done.\n");
 
 if (sort) {
@@ -91,12 +102,19 @@ public void index(ObjectStream eventStream) throws 
IOException {
* occur at least cutoff times are added to the
* predicatesInOut map along with a unique integer index.
*
+   * Protocol:
+   *  1 - (utf string) - Event outcome
+   *  2 - (int) - Event context array length
+   *  3+ - (utf string) - Event context string
+   *  4 - (int) - Event values array length
+   *  5+ - (float) - Event value
+   *
* @param eventStream an EventStream value
* @param eventStore a writer to which the events are written to for later 
processing.
* @param predicatesInOut a TObjectIntHashMap value
* @param cutoff an int value
*/
-  private int computeEventCounts(ObjectStream eventStream, Writer 
eventStore,
+  private int computeEventCounts(ObjectStream eventStream, 
DataOutputStream eventStore,
   Map predicatesInOut, int cutoff) throws IOException {
 Map counter = new HashMap<>();
 int eventCount = 0;
@@ -104,9 +122,23 @@ private int computeEventCounts(ObjectStream 
eventStream, Writer eventStor
 Event ev;
 while ((ev = eventStream.read()) != null) {
   eventCount++;
-  eventStore.write(FileEventStream.toLine(ev));
+
+  eventStore.writeUTF(ev.getOutcome());
+
+  eventStore.writeInt(ev.getContext().length);
   String[] ec = ev.getContext();
   update(ec, counter);
+  for (String ctxString : ec)
+eventStore.writeUTF(ctxString);
+
+  if (ev.getValues() == null) {
+eventStore.writeInt(0);
+  }
+  else {
+eventStore.writeInt(ev.getValues().length);
+for (float value : ev.getValues())
+  eventStore.writeFloat(value);
+  }
 }
 
 String[] predicateSet = counter.entrySet().stream()
@@ -122,4 +154,45 @@ private int computeEventCounts(ObjectStream 
eventStream, Writer eventStor
 
 return eventCount;
   }
+
+  private static class EventStream