[jira] [Commented] (OPENNLP-1173) Prepare the 1.8.4 release
[ https://issues.apache.org/jira/browse/OPENNLP-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299049#comment-16299049 ] Jeff Zemerick commented on OPENNLP-1173: Will base off commit c2f1b685abecfc11de76ffd0a28771f41b566782 > Prepare the 1.8.4 release > - > > Key: OPENNLP-1173 > URL: https://issues.apache.org/jira/browse/OPENNLP-1173 > Project: OpenNLP > Issue Type: Task >Reporter: Jeff Zemerick >Assignee: Jeff Zemerick > > This is a task to track the 1.8.4 release of OpenNLP. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-1173) Prepare the 1.8.4 release
[ https://issues.apache.org/jira/browse/OPENNLP-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick updated OPENNLP-1173: --- Description: This is a task to track the 1.8.4 release of OpenNLP. (was: Prepare the 1.8.4 release.) > Prepare the 1.8.4 release > - > > Key: OPENNLP-1173 > URL: https://issues.apache.org/jira/browse/OPENNLP-1173 > Project: OpenNLP > Issue Type: Task >Reporter: Jeff Zemerick >Assignee: Jeff Zemerick > > This is a task to track the 1.8.4 release of OpenNLP. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-1082) SentenceSampleStream should add EOS to samples if missing
[ https://issues.apache.org/jira/browse/OPENNLP-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick updated OPENNLP-1082: --- Fix Version/s: (was: 1.8.4) 1.8.5 > SentenceSampleStream should add EOS to samples if missing > - > > Key: OPENNLP-1082 > URL: https://issues.apache.org/jira/browse/OPENNLP-1082 > Project: OpenNLP > Issue Type: Improvement > Components: Sentence Detector >Reporter: William Colen >Assignee: William Colen > Fix For: 1.8.5 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-936) Add thread safe versions of some tools (ME sentence detection, tokenization, pos tagging)
[ https://issues.apache.org/jira/browse/OPENNLP-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick updated OPENNLP-936: -- Fix Version/s: (was: 1.8.4) 1.8.5 > Add thread safe versions of some tools (ME sentence detection, tokenization, > pos tagging) > - > > Key: OPENNLP-936 > URL: https://issues.apache.org/jira/browse/OPENNLP-936 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 1.7.1 >Reporter: Thilo Goetz >Priority: Minor > Fix For: 1.8.5 > > > As discussed on the mailing list, add thread safe versions of maximum entropy > sentence detection, tokenization and pos tagging. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-1150) TokenNameFinderTrainerTool should use ModelUtil.createDefaultTrainingParameters() when mlParams is null
[ https://issues.apache.org/jira/browse/OPENNLP-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick updated OPENNLP-1150: --- Fix Version/s: (was: 1.8.4) 1.8.5 > TokenNameFinderTrainerTool should use > ModelUtil.createDefaultTrainingParameters() when mlParams is null > --- > > Key: OPENNLP-1150 > URL: https://issues.apache.org/jira/browse/OPENNLP-1150 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.5 > > > Unlike other TrainerTools, TokenNameFinderTrainerTool create an empty > TrainingParameters when mlParams is null by calling the constructor. > TokenNameFinderTrainerTool should use > ModelUtil.createDefaultTrainingParameters() like as other TrainerTools do to > initialize mlParams. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-47) Rewrite the CONLL06 documentation based on the tutorial
[ https://issues.apache.org/jira/browse/OPENNLP-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick updated OPENNLP-47: - Fix Version/s: (was: 1.8.4) 1.8.5 > Rewrite the CONLL06 documentation based on the tutorial > --- > > Key: OPENNLP-47 > URL: https://issues.apache.org/jira/browse/OPENNLP-47 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Affects Versions: tools-1.5.1-incubating >Reporter: Joern Kottmann > Labels: help-wanted > Fix For: 1.8.5 > > > The CONLL06 documentation should be rewritten the reflect the new converters > which have been added to OpenNLP after its initial write. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-1140) Add 20 newsgroups format support
[ https://issues.apache.org/jira/browse/OPENNLP-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick updated OPENNLP-1140: --- Fix Version/s: (was: 1.8.4) 1.8.5 > Add 20 newsgroups format support > > > Key: OPENNLP-1140 > URL: https://issues.apache.org/jira/browse/OPENNLP-1140 > Project: OpenNLP > Issue Type: Improvement > Components: Formats >Reporter: Tommaso Teofili > Fix For: 1.8.5 > > > It'd be nice to have support for [20 > newsgroups|http://qwone.com/~jason/20Newsgroups/] format, especially for > evaluating {{DocCat}} models. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-47) Rewrite the CONLL06 documentation based on the tutorial
[ https://issues.apache.org/jira/browse/OPENNLP-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick updated OPENNLP-47: - Fix Version/s: (was: 1.8.4) > Rewrite the CONLL06 documentation based on the tutorial > --- > > Key: OPENNLP-47 > URL: https://issues.apache.org/jira/browse/OPENNLP-47 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Affects Versions: tools-1.5.1-incubating >Reporter: Joern Kottmann > Labels: help-wanted > Fix For: 1.8.4 > > > The CONLL06 documentation should be rewritten the reflect the new converters > which have been added to OpenNLP after its initial write. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-47) Rewrite the CONLL06 documentation based on the tutorial
[ https://issues.apache.org/jira/browse/OPENNLP-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick updated OPENNLP-47: - Fix Version/s: 1.8.4 > Rewrite the CONLL06 documentation based on the tutorial > --- > > Key: OPENNLP-47 > URL: https://issues.apache.org/jira/browse/OPENNLP-47 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Affects Versions: tools-1.5.1-incubating >Reporter: Joern Kottmann > Labels: help-wanted > Fix For: 1.8.4 > > > The CONLL06 documentation should be rewritten the reflect the new converters > which have been added to OpenNLP after its initial write. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (OPENNLP-1171) some tests create temp files and directories but never delete them
[ https://issues.apache.org/jira/browse/OPENNLP-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick closed OPENNLP-1171. -- > some tests create temp files and directories but never delete them > -- > > Key: OPENNLP-1171 > URL: https://issues.apache.org/jira/browse/OPENNLP-1171 > Project: OpenNLP > Issue Type: Bug > Components: Build, Packaging and Test >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.8.4 > > > Some temporary files and directories that are created in some tests are never > deleted and the number of temporary files/directories is increasing after > running mvn clean test. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (OPENNLP-1169) WordVectorTable should reference WVs by String
[ https://issues.apache.org/jira/browse/OPENNLP-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick closed OPENNLP-1169. -- > WordVectorTable should reference WVs by String > -- > > Key: OPENNLP-1169 > URL: https://issues.apache.org/jira/browse/OPENNLP-1169 > Project: OpenNLP > Issue Type: Bug > Components: word vectors >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 1.8.4 > > > {{WordVectorsTable}} API retrieves {{WordVector}} via {{CharSequence}} , this > is suboptimal as implementors could store such WVs via an hash table (e.g. > {{MapWordVectorsTable}}) and the value of {{CharSequence#toString}} is not > guaranteed to be the stable. > Additionally it's more common to have words as Strings rather than > CharSequences, being that more consistent with other OpenNLP APIs (e.g. > {{Tokenizer}} ). > So {{WordVectorsTable}} should instead retrieve {{WordVector}}s using String. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (OPENNLP-1167) WordVector toArray methods should be removed
[ https://issues.apache.org/jira/browse/OPENNLP-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick closed OPENNLP-1167. -- > WordVector toArray methods should be removed > > > Key: OPENNLP-1167 > URL: https://issues.apache.org/jira/browse/OPENNLP-1167 > Project: OpenNLP > Issue Type: Task > Components: word vectors >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 1.8.4 > > > {{WordVector#toDoubleArray}} and {{WordVector#toFloatArray}} always require a > copy, have size limitation and therefore should be probably removed. > Additionally we should think whether it makes sense to keep > {{FloatArrayVector#toDoubleBuffer}} and {{DoubleArrayVector#toFloatBuffer}} > which also require a copy. The alternative is to throw an > {{UnsupportedOperationException}} in such cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (OPENNLP-1168) Resolved concurrency issue in POS tagger.
[ https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick closed OPENNLP-1168. -- > Resolved concurrency issue in POS tagger. > - > > Key: OPENNLP-1168 > URL: https://issues.apache.org/jira/browse/OPENNLP-1168 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 1.8.4 >Reporter: Niels Schuette > Labels: easyfix, patch > Fix For: 1.8.4 > > > We encountered a concurrency issue in the pos tagger module in the class > DefaultPOSContextGenerator. > The issue is demonstrated in DefaultPOSContextGeneratorTest.java. The test > "multithreading()" consistently fails on our system with the current code if > the number of threads (NUMBER_OF_THREADS) is set to 10. If the number of > threads is set to 1 (effectively disabling multithreading), the test > consistently passes. > We resolved the issue by removing a field in DefaultPOSContextGenerator.java. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1168) Resolved concurrency issue in POS tagger.
[ https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zemerick resolved OPENNLP-1168. Resolution: Fixed > Resolved concurrency issue in POS tagger. > - > > Key: OPENNLP-1168 > URL: https://issues.apache.org/jira/browse/OPENNLP-1168 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 1.8.4 >Reporter: Niels Schuette > Labels: easyfix, patch > Fix For: 1.8.4 > > > We encountered a concurrency issue in the pos tagger module in the class > DefaultPOSContextGenerator. > The issue is demonstrated in DefaultPOSContextGeneratorTest.java. The test > "multithreading()" consistently fails on our system with the current code if > the number of threads (NUMBER_OF_THREADS) is set to 10. If the number of > threads is set to 1 (effectively disabling multithreading), the test > consistently passes. > We resolved the issue by removing a field in DefaultPOSContextGenerator.java. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1168) Resolved concurrency issue in POS tagger.
[ https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298247#comment-16298247 ] ASF GitHub Bot commented on OPENNLP-1168: - kottmann closed pull request #296: OPENNLP-1168: Resolved concurrency issue in POS tagger. URL: https://github.com/apache/opennlp/pull/296 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java b/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java index 3035ca523..3f4fe97ed 100644 --- a/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java +++ b/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java @@ -43,7 +43,6 @@ private Object wordsKey; private Dictionary dict; - private String[] dictGram; /** * Initializes the current instance. @@ -62,7 +61,7 @@ public DefaultPOSContextGenerator(Dictionary dict) { */ public DefaultPOSContextGenerator(int cacheSize, Dictionary dict) { this.dict = dict; -dictGram = new String[1]; + if (cacheSize > 0) { contextsCache = new Cache<>(cacheSize); } @@ -148,8 +147,8 @@ public DefaultPOSContextGenerator(int cacheSize, Dictionary dict) { e.add("default"); // add the word itself e.add("w=" + lex); -dictGram[0] = lex; -if (dict == null || !dict.contains(new StringList(dictGram))) { + +if (dict == null || !dict.contains(new StringList(lex))) { // do some basic suffix analysis String[] suffs = getSuffixes(lex); for (int i = 0; i < suffs.length; i++) { diff --git a/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java b/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java index 1edcf4b5b..4801bbf40 100644 --- a/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java +++ b/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java @@ -222,8 +222,8 @@ public void probs(double[] probs) { } public static POSModel train(String languageCode, - ObjectStream samples, TrainingParameters trainParams, - POSTaggerFactory posFactory) throws IOException { + ObjectStream samples, TrainingParameters trainParams, + POSTaggerFactory posFactory) throws IOException { int beamSize = trainParams.getIntParameter(BeamSearch.BEAM_SIZE_PARAMETER, POSTaggerME.DEFAULT_BEAM_SIZE); @@ -288,7 +288,7 @@ public static Dictionary buildNGramDictionary(ObjectStream samples, i } public static void populatePOSDictionary(ObjectStream samples, - MutableTagDictionary dict, int cutoff) throws IOException { + MutableTagDictionary dict, int cutoff) throws IOException { System.out.println("Expanding POS Dictionary ..."); long start = System.nanoTime(); diff --git a/opennlp-tools/src/test/java/opennlp/tools/postag/DefaultPOSContextGeneratorTest.java b/opennlp-tools/src/test/java/opennlp/tools/postag/DefaultPOSContextGeneratorTest.java new file mode 100644 index 0..450bb2cc3 --- /dev/null +++ b/opennlp-tools/src/test/java/opennlp/tools/postag/DefaultPOSContextGeneratorTest.java @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.postag; + +import java.util.Arrays; +import java.util.List; +import java.util.concurrent.Callable; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +import org.junit.Assert; +import org.junit.BeforeClass; +import org.junit.Test; + +import opennlp.tools.dictionary.Dictionary; +import opennlp.tools.util.StringList; + +/** + * + * We
[jira] [Commented] (OPENNLP-1168) Resolved concurrency issue in POS tagger.
[ https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298246#comment-16298246 ] ASF GitHub Bot commented on OPENNLP-1168: - kottmann commented on issue #296: OPENNLP-1168: Resolved concurrency issue in POS tagger. URL: https://github.com/apache/opennlp/pull/296#issuecomment-353025713 Merged. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Resolved concurrency issue in POS tagger. > - > > Key: OPENNLP-1168 > URL: https://issues.apache.org/jira/browse/OPENNLP-1168 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 1.8.4 >Reporter: Niels Schuette > Labels: easyfix, patch > Fix For: 1.8.4 > > > We encountered a concurrency issue in the pos tagger module in the class > DefaultPOSContextGenerator. > The issue is demonstrated in DefaultPOSContextGeneratorTest.java. The test > "multithreading()" consistently fails on our system with the current code if > the number of threads (NUMBER_OF_THREADS) is set to 10. If the number of > threads is set to 1 (effectively disabling multithreading), the test > consistently passes. > We resolved the issue by removing a field in DefaultPOSContextGenerator.java. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1168) Resolved concurrency issue in POS tagger.
[ https://issues.apache.org/jira/browse/OPENNLP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298210#comment-16298210 ] ASF GitHub Bot commented on OPENNLP-1168: - kottmann commented on a change in pull request #296: OPENNLP-1168: Resolved concurrency issue in POS tagger. URL: https://github.com/apache/opennlp/pull/296#discussion_r157981899 ## File path: opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerME.java ## @@ -222,8 +222,8 @@ public void probs(double[] probs) { } public static POSModel train(String languageCode, - ObjectStream samples, TrainingParameters trainParams, - POSTaggerFactory posFactory) throws IOException { + ObjectStream samples, TrainingParameters trainParams, Review comment: These minor formatting changes should be removed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Resolved concurrency issue in POS tagger. > - > > Key: OPENNLP-1168 > URL: https://issues.apache.org/jira/browse/OPENNLP-1168 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 1.8.4 >Reporter: Niels Schuette > Labels: easyfix, patch > Fix For: 1.8.4 > > > We encountered a concurrency issue in the pos tagger module in the class > DefaultPOSContextGenerator. > The issue is demonstrated in DefaultPOSContextGeneratorTest.java. The test > "multithreading()" consistently fails on our system with the current code if > the number of threads (NUMBER_OF_THREADS) is set to 10. If the number of > threads is set to 1 (effectively disabling multithreading), the test > consistently passes. > We resolved the issue by removing a field in DefaultPOSContextGenerator.java. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1130) Sentence detector format support for NKJP
[ https://issues.apache.org/jira/browse/OPENNLP-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298201#comment-16298201 ] ASF GitHub Bot commented on OPENNLP-1130: - kottmann closed pull request #263: OPENNLP-1130 Sentence detector format support for NKJP URL: https://github.com/apache/opennlp/pull/263 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/opennlp-tools/src/main/java/opennlp/tools/cmdline/StreamFactoryRegistry.java b/opennlp-tools/src/main/java/opennlp/tools/cmdline/StreamFactoryRegistry.java index 48b80256f..cd2b4dc69 100644 --- a/opennlp-tools/src/main/java/opennlp/tools/cmdline/StreamFactoryRegistry.java +++ b/opennlp-tools/src/main/java/opennlp/tools/cmdline/StreamFactoryRegistry.java @@ -61,6 +61,7 @@ import opennlp.tools.formats.letsmt.LetsmtSentenceStreamFactory; import opennlp.tools.formats.moses.MosesSentenceSampleStreamFactory; import opennlp.tools.formats.muc.Muc6NameSampleStreamFactory; +import opennlp.tools.formats.nkjp.NKJPSentenceSampleStreamFactory; import opennlp.tools.formats.ontonotes.OntoNotesNameSampleStreamFactory; import opennlp.tools.formats.ontonotes.OntoNotesPOSSampleStreamFactory; import opennlp.tools.formats.ontonotes.OntoNotesParseSampleStreamFactory; @@ -128,6 +129,7 @@ IrishSentenceBankSentenceStreamFactory.registerFactory(); IrishSentenceBankTokenSampleStreamFactory.registerFactory(); LeipzigLanguageSampleStreamFactory.registerFactory(); +NKJPSentenceSampleStreamFactory.registerFactory(); } public static final String DEFAULT_FORMAT = "opennlp"; diff --git a/opennlp-tools/src/main/java/opennlp/tools/formats/nkjp/NKJPSegmentationDocument.java b/opennlp-tools/src/main/java/opennlp/tools/formats/nkjp/NKJPSegmentationDocument.java new file mode 100644 index 0..b532bd941 --- /dev/null +++ b/opennlp-tools/src/main/java/opennlp/tools/formats/nkjp/NKJPSegmentationDocument.java @@ -0,0 +1,260 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.formats.nkjp; + +import java.io.File; +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.util.LinkedHashMap; +import java.util.Map; +import javax.xml.parsers.DocumentBuilder; +import javax.xml.xpath.XPath; +import javax.xml.xpath.XPathConstants; +import javax.xml.xpath.XPathExpression; +import javax.xml.xpath.XPathExpressionException; +import javax.xml.xpath.XPathFactory; + +import org.w3c.dom.Document; +import org.w3c.dom.Node; +import org.w3c.dom.NodeList; +import org.xml.sax.SAXException; + +import opennlp.tools.util.Span; +import opennlp.tools.util.XmlUtil; + +public class NKJPSegmentationDocument { + + public static class Pointer { +String doc; +String id; +int offset; +int length; +boolean space_after; + +public Pointer(String doc, String id, int offset, int length, boolean space_after) { + this.doc = doc; + this.id = id; + this.offset = offset; + this.length = length; + this.space_after = space_after; +} + +public Span toSpan() { + return new Span(this.offset, this.offset + this.length); +} + +@Override +public String toString() { + return doc + "#string-range(" + id + "," + Integer.toString(offset) + + "," + Integer.toString(length) + ")"; +} + } + + public Map> getSegments() { +return segments; + } + + Map > segments; + + NKJPSegmentationDocument() { +this.segments = new LinkedHashMap<>(); + } + + NKJPSegmentationDocument(Map > segments) { +this(); +this.segments = segments; + } + + public static NKJPSegmentationDocument parse(InputStream is) throws IOException { + +Map > sentences = new LinkedHashMap<>(); + +try { + DocumentBuilder docBuilder = XmlUtil.createDocumentBuilder();; +
[jira] [Commented] (OPENNLP-1166) TwoPassDataIndexer fails if features contain \n
[ https://issues.apache.org/jira/browse/OPENNLP-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298140#comment-16298140 ] ASF GitHub Bot commented on OPENNLP-1166: - kottmann closed pull request #294: OPENNLP-1166: TwoPassDataIndexer fails if features contain \n URL: https://github.com/apache/opennlp/pull/294 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java b/opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java index 5e347e886..4121e36c1 100644 --- a/opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java +++ b/opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java @@ -17,13 +17,16 @@ package opennlp.tools.ml.model; -import java.io.BufferedWriter; + +import java.io.BufferedInputStream; +import java.io.BufferedOutputStream; +import java.io.DataInputStream; +import java.io.DataOutputStream; import java.io.File; +import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; -import java.io.OutputStreamWriter; -import java.io.Writer; -import java.nio.charset.StandardCharsets; +import java.math.BigInteger; import java.util.HashMap; import java.util.List; import java.util.Map; @@ -59,20 +62,28 @@ public void index(ObjectStream eventStream) throws IOException { File tmp = File.createTempFile("events", null); tmp.deleteOnExit(); int numEvents; -try (Writer osw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(tmp), -StandardCharsets.UTF_8))) { - numEvents = computeEventCounts(eventStream, osw, predicateIndex, cutoff); +BigInteger writeHash; +HashSumEventStream writeEventStream = new HashSumEventStream(eventStream); // do not close. +try (DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(tmp { + numEvents = computeEventCounts(writeEventStream, dos, predicateIndex, cutoff); } +writeHash = writeEventStream.calculateHashSum(); + display("done. " + numEvents + " events\n"); display("\tIndexing... "); List eventsToCompare; -try (FileEventStream fes = new FileEventStream(tmp)) { - eventsToCompare = index(fes, predicateIndex); +BigInteger readHash = null; +try (HashSumEventStream readStream = new HashSumEventStream(new EventStream(tmp))) { + eventsToCompare = index(readStream, predicateIndex); + readHash = readStream.calculateHashSum(); } - tmp.delete(); + +if (readHash.compareTo(writeHash) != 0) + throw new IOException("Event hash for writing and reading events did not match."); + display("done.\n"); if (sort) { @@ -91,12 +102,19 @@ public void index(ObjectStream eventStream) throws IOException { * occur at least cutoff times are added to the * predicatesInOut map along with a unique integer index. * + * Protocol: + * 1 - (utf string) - Event outcome + * 2 - (int) - Event context array length + * 3+ - (utf string) - Event context string + * 4 - (int) - Event values array length + * 5+ - (float) - Event value + * * @param eventStream an EventStream value * @param eventStore a writer to which the events are written to for later processing. * @param predicatesInOut a TObjectIntHashMap value * @param cutoff an int value */ - private int computeEventCounts(ObjectStream eventStream, Writer eventStore, + private int computeEventCounts(ObjectStream eventStream, DataOutputStream eventStore, MappredicatesInOut, int cutoff) throws IOException { Map counter = new HashMap<>(); int eventCount = 0; @@ -104,9 +122,23 @@ private int computeEventCounts(ObjectStream eventStream, Writer eventStor Event ev; while ((ev = eventStream.read()) != null) { eventCount++; - eventStore.write(FileEventStream.toLine(ev)); + + eventStore.writeUTF(ev.getOutcome()); + + eventStore.writeInt(ev.getContext().length); String[] ec = ev.getContext(); update(ec, counter); + for (String ctxString : ec) +eventStore.writeUTF(ctxString); + + if (ev.getValues() == null) { +eventStore.writeInt(0); + } + else { +eventStore.writeInt(ev.getValues().length); +for (float value : ev.getValues()) + eventStore.writeFloat(value); + } } String[] predicateSet = counter.entrySet().stream() @@ -122,4 +154,45 @@ private int computeEventCounts(ObjectStream eventStream, Writer eventStor return eventCount; } + + private static class EventStream