Author: rwesten
Date: Thu Sep 22 12:35:15 2011
New Revision: 1174091
URL: http://svn.apache.org/viewvc?rev=1174091&view=rev
Log:
corrected some formatting issues
changed file name to lower case
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordextractionengine.mdtext
- copied, changed from r1174089,
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext
Copied:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordextractionengine.mdtext
(from r1174089,
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext)
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordextractionengine.mdtext?p2=incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordextractionengine.mdtext&p1=incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext&r1=1174089&r2=1174091&rev=1174091&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordextractionengine.mdtext
Thu Sep 22 12:35:15 2011
@@ -1,4 +1,4 @@
-# KeywordExtractionEngine #
+# KeywordExtractionEngine #
The
[KeywordExtractionEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/)
is a re-implementation of the
[TaxonomyLinkingEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/)
that is more modular and therefore better suited for future improvements and
extensions as requested by
[STANBOL-303](https://issues.apache.org/jira/browse/STANBOL-303).
@@ -9,10 +9,15 @@ The KeywordExtractionEngine can supports
* **Language detection:** The KexwordExtractionEngine depends on the correct
detection of the language by the LangIdEnhancementEngine. If no language was
detected or this information is missing than "English" is assumed as default.
* **Multi-Lingual labels of the Controlled Vocabulary:** Occurrences are
searched within labels of the current Language and labels without any defined
language. e.g. English labels will not be matched against German language texts.
* **Natural Language Processing support:** The KexwordExtractionEngine is able
to use [Sentence
Detectors](http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html),
[POS (Part of Speech)
taggers](http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html)
and
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html).
If such components are available for a language the they are used to optimize
the enhancement process.
+
**Sentence detector:** If a sentence detector is present the memory
footprint of the engines improves, because Tokens, POS tags and Chunks are only
kept for the currently active sentence. If no sentence detector is available
the whole content is treated as a single Sentence.
+
**Tokenizer:** A (word)
[tokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html)
is required. If no tokenizer is available for a given language, than the
[OpenNLP
SimpleTokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html)
is used as default.
+
**POS tagger:** POS taggers annotate tokens with there type. Because of the
KeywordExtractionEngine is only interested in Nouns, Foreign Words and Numbers
the presence of such an tagger allows to skip a lot of the tokens and to
improve performance. However POS taggers use different sets of tags for
different languages. Because of that it is not enough that a POS tagger is
available for a language there MUST BE also a configuration of the POS tags for
that language that need to be processed.
+
**Chunker:** There are two types of Chunkers. First the
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html)
as provided by OpenNLP (based on statistical models) and second a [POS tag
based
Chunker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java)
provided by the openNLP bundle of Stanbol. Currently the Availability of a
Chunker does not have an big influence on the performance nor the quality of
the Enhancements.
+
* **Configuration:** The set of languages to be annotated can be configured
for the KexwordExtractionEngine. An empty configuration indicates that texts in
any language should be processed. By using this configuration it is possible to
configure different KexwordExtractionEngine instances for different languages
(e.g. with different configurations)
## Keyword extraction workflow ##