[ https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martin Wiesner reassigned OPENNLP-1357: --------------------------------------- Assignee: Martin Wiesner > Use CharSequence to allow for memory management > ----------------------------------------------- > > Key: OPENNLP-1357 > URL: https://issues.apache.org/jira/browse/OPENNLP-1357 > Project: OpenNLP > Issue Type: Improvement > Components: Sentence Detector > Affects Versions: 1.9.4 > Reporter: Paul Austin > Assignee: Martin Wiesner > Priority: Minor > Fix For: 2.1.1 > > > Most of the classes in OpenNLP require the inputs to be as String, > StringBuffer, or char[]. This means that you have to load all the data into > memory. > Many of these cases (String and StringBuffer args) could be replaced with a > single method that accepts CharSequence as a parameter. > For example DefaultEndOfSentenceScanner > > {code:java} > public List<Integer> getPositions(CharSequence s) { > List<Integer> l = new ArrayList<>(); > for (int i = 0; i < s.length(); i++) { > char c = s.charAt(i); > if (eosCharacters.contains(c)) { > l.add(i); > } > } > return l; > } > {code} > This would allow for users to manage the memory overhead for large data sets. > And in some cases require less temporary memory conversion to char buffers. > Some code such as the SDContextGenerator already uses CharSequence. However > in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. > The sb isn't modified and the SDContextGenerator.getContext takes > CharSequence as an arg and String is a CharSequence. > > {code:java} > public Span[] sentPosDetect(String s) { > sentProbs.clear(); > StringBuffer sb = new StringBuffer(s);{code} > > I can create a pull request(s) for the above if you think it is useful. > -- This message was sent by Atlassian Jira (v8.20.10#820010)