Author: rwesten
Date: Thu Sep 22 12:32:54 2011
New Revision: 1174089
URL: http://svn.apache.org/viewvc?rev=1174089&view=rev
Log:
corrected some formatting issues
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext?rev=1174089&r1=1174088&r2=1174089&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.mdtext
Thu Sep 22 12:32:54 2011
@@ -1,25 +1,25 @@
-= KeywordExtractionEngine =
+# KeywordExtractionEngine #
The
[KeywordExtractionEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/)
is a re-implementation of the
[TaxonomyLinkingEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/)
that is more modular and therefore better suited for future improvements and
extensions as requested by
[STANBOL-303](https://issues.apache.org/jira/browse/STANBOL-303).
-== Multiple Language Support ==
+## Multiple Language Support ##
The KeywordExtractionEngine can supports multiple languages. However the
performance and to some extend also the quality of the enhancements are
depended on the following parameters
* **Language detection:** The KexwordExtractionEngine depends on the correct
detection of the language by the LangIdEnhancementEngine. If no language was
detected or this information is missing than "English" is assumed as default.
* **Multi-Lingual labels of the Controlled Vocabulary:** Occurrences are
searched within labels of the current Language and labels without any defined
language. e.g. English labels will not be matched against German language texts.
* **Natural Language Processing support:** The KexwordExtractionEngine is able
to use [Sentence
Detectors](http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html),
[POS (Part of Speech)
taggers](http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html)
and
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html).
If such components are available for a language the they are used to optimize
the enhancement process.
- * Sentence detector: If a sentence detector is present the memory footprint
of the engines improves, because Tokens, POS tags and Chunks are only kept for
the currently active sentence. If no sentence detector is available the whole
content is treated as a single Sentence.
- * Tokenizer: A (word)
[tokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html)
is required. If no tokenizer is available for a given language, than the
[OpenNLP
SimpleTokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html)
is used as default.
- * POS tagger: POS taggers annotate tokens with there type. Because of the
KeywordExtractionEngine is only interested in Nouns, Foreign Words and Numbers
the presence of such an tagger allows to skip a lot of the tokens and to
improve performance. However POS taggers use different sets of tags for
different languages. Because of that it is not enough that a POS tagger is
available for a language there MUST BE also a configuration of the POS tags for
that language that need to be processed.
- * Chunker: There are two types of Chunkers. First the
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html)
as provided by OpenNLP (based on statistical models) and second a [POS tag
based
Chunker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java)
provided by the openNLP bundle of Stanbol. Currently the Availability of a
Chunker does not have an big influence on the performance nor the quality of
the Enhancements.
+ **Sentence detector:** If a sentence detector is present the memory
footprint of the engines improves, because Tokens, POS tags and Chunks are only
kept for the currently active sentence. If no sentence detector is available
the whole content is treated as a single Sentence.
+ **Tokenizer:** A (word)
[tokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html)
is required. If no tokenizer is available for a given language, than the
[OpenNLP
SimpleTokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html)
is used as default.
+ **POS tagger:** POS taggers annotate tokens with there type. Because of the
KeywordExtractionEngine is only interested in Nouns, Foreign Words and Numbers
the presence of such an tagger allows to skip a lot of the tokens and to
improve performance. However POS taggers use different sets of tags for
different languages. Because of that it is not enough that a POS tagger is
available for a language there MUST BE also a configuration of the POS tags for
that language that need to be processed.
+ **Chunker:** There are two types of Chunkers. First the
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html)
as provided by OpenNLP (based on statistical models) and second a [POS tag
based
Chunker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java)
provided by the openNLP bundle of Stanbol. Currently the Availability of a
Chunker does not have an big influence on the performance nor the quality of
the Enhancements.
* **Configuration:** The set of languages to be annotated can be configured
for the KexwordExtractionEngine. An empty configuration indicates that texts in
any language should be processed. By using this configuration it is possible to
configure different KexwordExtractionEngine instances for different languages
(e.g. with different configurations)
-== Keyword extraction workflow ==
+## Keyword extraction workflow ##
Basically the Text is parsed from the beginning to the end and words are
looked up in the configured Controlled Vocabulary.
-=== Text Processing ===
+### Text Processing ###
The
[AnalysedContent](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java)
Interface is used to access natural language text that was already processed
by an NLP framework. Currently there is only a single implementation based on
the commons.opennlp
[TextAnalyzer](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java)
utility. In general this part is still very focused on OpenNLP. Making it also
usable together with other NLP frameworks would probably need a lot of
refactoring.
@@ -32,7 +32,7 @@ The current state of the processing is r
The ProcessingState provides means to navigate to the next token. If chunks
are present tokens that are outside of chunks are ignored.
-=== Entity Lookup ===
+### Entity Lookup ###
A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities via
the
[EntitySearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java)
interface. If the actual implementation cut off results, than it must be
ensured that Entities that match both tokens are ranked first.
Currently there are two implementations of this interface (1) for the
Entityhub
([EntityhubSearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java))
and (2) for ReferencedSitess
([ReferencedSiteSearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java)).
There is also an
[Implementation](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java)
that holds entities in-memory, however currently this is only used for unit
tests.
@@ -46,7 +46,7 @@ Only "processable" Tokens are used to lo
Typically the next MAX_SEARCH_TOKENS processable tokens are used for a lookup.
However the current Chunk/Sentence is never left in the search for processable
tokens.
-=== Matching of found Entities: ===
+### Matching of found Entities: ###
All labels (values of the
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getNameField()
field) in the language of the content or without any defined language are
candidates for matches.
@@ -64,7 +64,7 @@ Entities are [Suggested](http://svn.apac
The described matching process is currently directly part of the
[EntityLinker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java).
To support different matching strategies this would need to be externalized
into an own "EntityLabelMatcher" interface.
-=== Processing of Entity Suggestions ===
+### Processing of Entity Suggestions ###
In case there are one or more
[Suggestion](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java)s
of Entities for the current position within the text a
[LinkedEntity](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/LinkedEntity.java)
instance is created.
@@ -77,7 +77,7 @@ In some cases suggested entities might r
To support such use cases the KeywordExtractionEngine has support for
redirects. Users can first configure the redirect mode (ignore, copy values,
follow) and secondly the field used to search for redirects
(default=rdfs:seeAlso).
If the redirect mode != ignore for each suggestion the Entities referenced by
the configured redirect field are retrieved. In case of the copy values mode
the values of the name, and type field are copied. In case of the follow mode
the suggested entity is repressed with the first redirected entity.
-=== Confidence for Suggestions ===
+### Confidence for Suggestions ###
The confidence for suggestions is calculated based on the following algorithm:
@@ -98,7 +98,7 @@ Some Examples:
The calculation of the confidence is currently direct part of the
[EntityLinker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java).
To support different matching strategies this would need to be externalized
into an own interface.
-== Future Plans for the TaxonomyLinkingEngine ==
+## Future Plans for the TaxonomyLinkingEngine ##
The TaxonomyLinkingEngine is still available and fully functional. However it
is marked as deprecated and not included in any of the launchers. Current users
are encouraged to switch over to the KeywordExtractionEngine.