KeywordExtractionEngine.html

buildbot Thu, 22 Sep 2011 05:33:29 -0700

Author: buildbot
Date: Thu Sep 22 12:33:03 2011
New Revision: 796102

Log:
Staging update by buildbot


Modified:
    
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.html

Modified: 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.html
==============================================================================
--- 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.html
 (original)
+++ 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.html
 Thu Sep 22 12:33:03 2011
@@ -46,23 +46,23 @@
   
   <div id="content">
     <h1 class="title"></h1>
-    <p>= KeywordExtractionEngine = </p>
+    <h1 id="keywordextractionengine_">KeywordExtractionEngine #</h1>
 <p>The <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/";>KeywordExtractionEngine</a>
 is a re-implementation of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/";>TaxonomyLinkingEngine</a>
 that is more modular and therefore better suited for future improvements and 
extensions as requested by <a 
href="https://issues.apache.org/jira/browse/STANBOL-303";>STANBOL-303</a>.</p>
-<p>== Multiple Language Support ==</p>
+<h2 id="multiple_language_support">Multiple Language Support</h2>
 <p>The KeywordExtractionEngine can supports multiple languages. However the 
performance and to some extend also the quality of the enhancements are 
depended on the following parameters</p>
 <ul>
 <li><strong>Language detection:</strong> The KexwordExtractionEngine depends 
on the correct detection of the language by the LangIdEnhancementEngine. If no 
language was detected or this information is missing than "English" is assumed 
as default.</li>
 <li><strong>Multi-Lingual labels of the Controlled Vocabulary:</strong> 
Occurrences are searched within labels of the current Language and labels 
without any defined language. e.g. English labels will not be matched against 
German language texts.</li>
-<li><strong>Natural Language Processing support:</strong> The 
KexwordExtractionEngine is able to use <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html";>Sentence
 Detectors</a>, <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html";>POS
 (Part of Speech) taggers</a> and <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html";>Chunkers</a>.
 If such components are available for a language the they are used to optimize 
the enhancement process.</li>
-<li>Sentence detector: If a sentence detector is present the memory footprint 
of the engines improves, because Tokens, POS tags and Chunks are only kept for 
the currently active sentence. If no sentence detector is available the whole 
content is treated as a single Sentence.</li>
-<li>Tokenizer: A (word) <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html";>tokenizer</a>
 is required. If no tokenizer is available for a given language, than the <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html";>OpenNLP
 SimpleTokenizer</a> is used as default.</li>
-<li>POS tagger: POS taggers annotate tokens with there type. Because of the 
KeywordExtractionEngine is only interested in Nouns, Foreign Words and Numbers 
the presence of such an tagger allows to skip a lot of the tokens and to 
improve performance. However POS taggers use different sets of tags for 
different languages. Because of that it is not enough that a POS tagger is 
available for a language there MUST BE also a configuration of the POS tags for 
that language that need to be processed.</li>
-<li>Chunker: There are two types of Chunkers. First the <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html";>Chunkers</a>
 as provided by OpenNLP (based on statistical models) and second a <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java";>POS
 tag based Chunker</a> provided by the openNLP bundle of Stanbol. Currently the 
Availability of a Chunker does not have an big influence on the performance nor 
the quality of the Enhancements.</li>
+<li><strong>Natural Language Processing support:</strong> The 
KexwordExtractionEngine is able to use <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html";>Sentence
 Detectors</a>, <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html";>POS
 (Part of Speech) taggers</a> and <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html";>Chunkers</a>.
 If such components are available for a language the they are used to optimize 
the enhancement process.
+  <strong>Sentence detector:</strong> If a sentence detector is present the 
memory footprint of the engines improves, because Tokens, POS tags and Chunks 
are only kept for the currently active sentence. If no sentence detector is 
available the whole content is treated as a single Sentence.
+  <strong>Tokenizer:</strong> A (word) <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html";>tokenizer</a>
 is required. If no tokenizer is available for a given language, than the <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html";>OpenNLP
 SimpleTokenizer</a> is used as default.
+  <strong>POS tagger:</strong> POS taggers annotate tokens with there type. 
Because of the KeywordExtractionEngine is only interested in Nouns, Foreign 
Words and Numbers the presence of such an tagger allows to skip a lot of the 
tokens and to improve performance. However POS taggers use different sets of 
tags for different languages. Because of that it is not enough that a POS 
tagger is available for a language there MUST BE also a configuration of the 
POS tags for that language that need to be processed.
+  <strong>Chunker:</strong> There are two types of Chunkers. First the <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html";>Chunkers</a>
 as provided by OpenNLP (based on statistical models) and second a <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java";>POS
 tag based Chunker</a> provided by the openNLP bundle of Stanbol. Currently the 
Availability of a Chunker does not have an big influence on the performance nor 
the quality of the Enhancements.</li>
 <li><strong>Configuration:</strong> The set of languages to be annotated can 
be configured for the KexwordExtractionEngine. An empty configuration indicates 
that texts in any language should be processed. By using this configuration it 
is possible to configure different KexwordExtractionEngine instances for 
different languages (e.g. with different configurations)</li>
 </ul>
-<p>== Keyword extraction workflow ==</p>
+<h2 id="keyword_extraction_workflow">Keyword extraction workflow</h2>
 <p>Basically the Text is parsed from the beginning to the end and words are 
looked up in the configured Controlled Vocabulary.</p>
-<p>=== Text Processing ===</p>
+<h3 id="text_processing">Text Processing</h3>
 <p>The <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java";>AnalysedContent</a>
 Interface is used to access natural language text that was already processed 
by an NLP framework. Currently there is only a single implementation based on 
the commons.opennlp <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java";>TextAnalyzer</a>
 utility. In general this part is still very focused on OpenNLP. Making it also 
usable together with other NLP frameworks would probably need a lot of 
refactoring.</p>
 <p>The current state of the processing is represented by the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/ProcessingState.java";>ProcessingState</a>.
 Based on the capabilities of the NLP framework for the current language it 
provides a different set of information:</p>
 <ul>
@@ -72,7 +72,7 @@
 <li><strong>TokenIndex:</strong> The index of the currently active token 
relative to the AnalysedSentence.</li>
 </ul>
 <p>The ProcessingState provides means to navigate to the next token. If chunks 
are present tokens that are outside of chunks are ignored.</p>
-<p>=== Entity Lookup ===</p>
+<h3 id="entity_lookup">Entity Lookup</h3>
 <p>A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities 
via the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java";>EntitySearcher</a>
 interface. If the actual implementation cut off results, than it must be 
ensured that Entities that match both tokens are ranked first.
 Currently there are two implementations of this interface (1) for the 
Entityhub (<a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java";>EntityhubSearcher</a>)
 and (2) for ReferencedSitess (<a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java";>ReferencedSiteSearcher</a>).
 There is also an <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java";>Implementation</a>
 that holds entities in-memory, however currently this is only used for unit 
tests.</p>
 <p>Queries do use the configured <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getNameField()
 and the language of labels is restricted to the current language or labels 
that do not define any language.</p>
@@ -82,7 +82,7 @@ Currently there are two implementations 
 <li>If this method returns NULL or no POS tags are available, than all Tokens 
with equals or more Chars than <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getMinSearchTokenLength()
 (default=3) are considered as processable.</li>
 </ul>
 <p>Typically the next MAX_SEARCH_TOKENS processable tokens are used for a 
lookup. However the current Chunk/Sentence is never left in the search for 
processable tokens.</p>
-<p>=== Matching of found Entities: ===</p>
+<h3 id="matching_of_found_entities">Matching of found Entities:</h3>
 <p>All labels (values of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getNameField()
 field) in the language of the content or without any defined language are 
candidates for matches.</p>
 <p>For each label that fulfills the above criteria the following steps are 
processed. The best result is used as the result of the whole matching 
process:</p>
 <ul>
@@ -97,7 +97,7 @@ Currently there are two implementations 
 <li>a label matches at least <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getMinFoundTokens()
 (default=2) are matching with the text. This ensures that "<a 
href="http://en.wikipedia.org/wiki/Rupert_Murdoch";>Rupert Murdoch</a>" is not 
suggested for "<a href="http://en.wikipedia.org/wiki/Rupert";>Rupert</a>" but 
ensures that "Barack Hussein Obama" is suggested for "Barack Obama". Setting 
"minFoundToken" to values less than two will usually cause a lot of false 
positives, but would also come up with a suggestion for "Barack Obama" if the 
content contains the word "Obama".</li>
 </ul>
 <p>The described matching process is currently directly part of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java";>EntityLinker</a>.
 To support different matching strategies this would need to be externalized 
into an own "EntityLabelMatcher" interface.</p>
-<p>=== Processing of Entity Suggestions ===</p>
+<h3 id="processing_of_entity_suggestions">Processing of Entity Suggestions</h3>
 <p>In case there are one or more <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java";>Suggestion</a>s
 of Entities for the current position within the text a <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/LinkedEntity.java";>LinkedEntity</a>
 instance is created.</p>
 <p>LinkedEntity is an object model representing the Stanbol Enhancement 
Structure. After the processing of the parsed content is completed the 
LinkedEntities are "serialized" as RDF triples to the metadata of the 
ContentItem.</p>
 <p>TextAnnotations as defined in the <a 
href="http://wiki.iks-project.eu/index.php/EnhancementStructure";>Stanbol 
Enhancement Structure</a> do use the <a 
href="http://www.dublincore.org/documents/dcmi-terms/#terms-type";>dc:type</a> 
property to provide the general type of the extracted Entity. However suggested 
Entities might have very specific types. Therefore the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>
 provides the possibility to map the specific types of the Entity to types used 
for the dc:type property of TextAnnotations. The <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.DEFAULT_ENTITY_TYPE_MAPPINGS
 contains some pred
 efined mappings.
@@ -105,7 +105,7 @@ Note that field used to retrieve the typ
 <p>In some cases suggested entities might redirect to others. In the case of 
Wikipedia/DBpedia this is often used to link from acronyms like <a 
href="http://en.wikipedia.org/w/index.php?title=IMF&amp;redirect=no";>IMF</a> to 
the real entity <a 
href="http://en.wikipedia.org/wiki/International_Monetary_Fund";>International 
Monetary Fund</a>. But also some Thesaurus define labels as own Entities with 
an URI and users might want to use the URI of the Concept rather than one of 
the label.
 To support such use cases the KeywordExtractionEngine has support for 
redirects. Users can first configure the redirect mode (ignore, copy values, 
follow) and secondly the field used to search for redirects 
(default=rdfs:seeAlso).
 If the redirect mode != ignore for each suggestion the Entities referenced by 
the configured redirect field are retrieved. In case of the copy values mode 
the values of the name, and type field are copied. In case of the follow mode 
the suggested entity is repressed with the first redirected entity.</p>
-<p>=== Confidence for Suggestions ===</p>
+<h3 id="confidence_for_suggestions">Confidence for Suggestions</h3>
 <p>The confidence for suggestions is calculated based on the following 
algorithm:</p>
 <p>Input Parameters</p>
 <ul>
@@ -122,7 +122,7 @@ If the redirect mode != ignore for each 
 <li>"New York City" matched against the text "New York Rangers" - assuming 
that "New York Rangers" is the best match - results in a confidence of (2/3)^2 
* (2/2) * (2/3) = 0,3; Note that the best match "New York Rangers" has 
max_matched=3 and gets a confidence of 1.</li>
 </ul>
 <p>The calculation of the confidence is currently direct part of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java";>EntityLinker</a>.
 To support different matching strategies this would need to be externalized 
into an own interface.</p>
-<p>== Future Plans for the TaxonomyLinkingEngine ==</p>
+<h2 id="future_plans_for_the_taxonomylinkingengine">Future Plans for the 
TaxonomyLinkingEngine</h2>
 <p>The TaxonomyLinkingEngine is still available and fully functional. However 
it is marked as deprecated and not included in any of the launchers. Current 
users are encouraged to switch over to the KeywordExtractionEngine. </p>
 <p>In Future it is planed to repurpose the TaxonomyLinkingEngine as a special 
version of the KeywordExtractionEngine with a specialized configuration and 
feature set targeted for (hierarchical) Taxonomies.</p>
 <p>This will include: </p>

svn commit: r796102 - /websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/KeywordExtractionEngine.html

Reply via email to