Author: buildbot
Date: Thu Sep 22 15:38:45 2011
New Revision: 796109
Log:
Staging update by buildbot
Added:
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
Added:
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
==============================================================================
---
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
(added)
+++
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
Thu Sep 22 15:38:45 2011
@@ -0,0 +1,151 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE- 2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+ <link href="/stanbol/css/stanbol.css" rel="stylesheet" type="text/css">
+ <title>Apache Stanbol - The Keyword Linking Engine: custom vocabularies and
multiple languages</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+ <link rel="icon" type="image/png"
href="/stanbol/images/stanbol-logo/stanbol-favicon.png"/>
+</head>
+
+<body>
+ <div id="navigation">
+ <img alt="Apache Stanbol" width="220" height="101"
src="/stanbol/images/stanbol-logo/stanbol-2010-12-14.png"/>
+ <h1 id="stanbol_links">Stanbol links</h1>
+<ul>
+<li><a href="/stanbol/index.html">Home</a></li>
+<li><a href="/stanbol/team.html">Project Team</a></li>
+<li><a href="/stanbol/docs/trunk/">Documentation</a></li>
+</ul>
+<h1 id="asf_links">ASF links</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a
Sponsor</a></li>
+<li><a href="http://www.apache.org/security/">Security</a></li>
+</ul>
+ </div>
+
+ <div id="content">
+ <h1 class="title">The Keyword Linking Engine: custom vocabularies and
multiple languages</h1>
+ <p>The <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/">KeywordLinkingEngine</a>
is a re-implementation of the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/">TaxonomyLinkingEngine</a>
that is more modular and therefore better suited for future improvements and
extensions as requested by <a
href="https://issues.apache.org/jira/browse/STANBOL-303">STANBOL-303</a>. </p>
+<p>Currently the main advantage of using this engine is its ability to support
multiple languages and provide enhancement results specific to custom
vocabulary. </p>
+<h2 id="multiple_language_support">Multiple Language Support</h2>
+<p>The KeywordLinkingEngine supports the extraction of keywords in multiple
languages. However, the performance and to some extend also the quality of the
enhancements depend on how well a language is supported by the used NLP
framework (currently OpenNLP).
+The following list provides a short overview about the different language
specific component/configurations:</p>
+<ul>
+<li><strong>Language detection:</strong> The KeywordLinkingEngine depends on
the correct detection of the language by the LanguageIdentificationEngine. If
no language is detected or this information is missing then "English" is
assumed as default.</li>
+<li><strong>Multi-lingual labels of the controlled vocabulary:</strong>
Entities are matched based on labels of the current language and labels without
any defined language. e.g. English labels will not be matched against German
language texts. Therefore it is important to have a controlled vocabulary that
includes labels in the language of the texts you want to enhance.</li>
+<li><strong>Natural Language Processing support:</strong> The
KeywordLinkingEngine is able to use <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html">Sentence
Detectors</a>, <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html">POS
(Part of Speech) taggers</a> and <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html">Chunkers</a>.
If such components are available for a language then they are used to optimize
the enhancement process.</li>
+</ul>
+<p><strong>Sentence detector:</strong> If a sentence detector is present the
memory footprint of the engines improves, because Tokens, POS tags and Chunks
are only kept for the currently active sentence. If no sentence detector is
available the entire content is treated as a single sentence.</p>
+<p><strong>Tokenizer:</strong> A (word) <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html">tokenizer</a>
is required for the enhancement process. If no specific tokenizer is available
for a given language, then the <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html">OpenNLP
SimpleTokenizer</a> is used as default. How well this tokenizer works will
depend on the language.</p>
+<p><strong>POS tagger:</strong> POS (Part-of-Speech) taggers annotate tokens
with their type. Because of the KeywordLinkingEngine is only interested in
Nouns, Foreign Words and Numbers, the presence of such a tagger allows to skip
a lot of the tokens and to improve performance. However POS taggers use
different sets of tags for different languages. Because of that it is not
enough that a POS tagger is available for a language there MUST BE also a
configuration of the POS tags representing Nouns.</p>
+<p><strong>Chunker:</strong> There are two types of Chunkers. First the <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html">Chunkers</a>
as provided by OpenNLP (based on statistical models) and second a <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java">POS
tag based Chunker</a> provided by the openNLP bundle of Stanbol. Currently the
availability of a Chunker does not have a big influence on the performance nor
the quality of the Enhancements.</p>
+<ul>
+<li><strong>Configuration:</strong> The set of languages to be annotated can
be configured for the KeywordLinkingEngine. An empty configuration indicates
that texts in any language should be processed. By using this configuration it
is possible to configure different KeywordLinkingEngine instances for different
languages (e.g. with different configurations)</li>
+</ul>
+<h2 id="keyword_extraction_and_linking_workflow">Keyword extraction and
linking workflow</h2>
+<p>Basically the text is parsed from the beginning to the end and words are
looked up in the configured controlled vocabulary.</p>
+<h3 id="text_processing">Text Processing</h3>
+<p>The <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java">AnalysedContent</a>
Interface is used to access natural language text that was already processed
by an NLP framework. Currently there is only a single implementation based on
the commons.opennlp <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java">TextAnalyzer</a>
utility. In general this part is still very focused on OpenNLP. Making it also
usable together with other NLP frameworks would probably need some
re-factoring.</p>
+<p>The current state of the processing is represented by the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/ProcessingState.java">ProcessingState</a>.
Based on the capabilities of the NLP framework for the current language it
provides a the following set of information:</p>
+<ul>
+<li><strong>AnalysedSentence:</strong> If a sentence detector is present, than
this represent the current sentence of the text. If not, then the whole text is
represented as a single sentence. The AnalysedSentence also provides access to
POS tags and Chunks (if available)</li>
+<li><strong>Chunk:</strong> If a chunker is present, then this represents the
current chunk. Otherwise this will be null. </li>
+<li><strong>Token:</strong> The currently processed word part of the chunk and
the sentence.</li>
+<li><strong>TokenIndex:</strong> The index of the currently active token
relative to the AnalysedSentence.</li>
+</ul>
+<p>The ProcessingState provides means to navigate to the next token. If chunks
are present tokens that are outside of chunks are ignored.</p>
+<h3 id="entity_lookup">Entity Lookup</h3>
+<p>A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities
via the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java">EntitySearcher</a>
interface. If the actual implementation cut off results, than it must be
ensured that Entities that match both tokens are ranked first.
+Currently there are two implementations of this interface: (1) for the
Entityhub (<a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java">EntityhubSearcher</a>)
and (2) for ReferencedSites (<a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java">ReferencedSiteSearcher</a>).
There is also an <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java">Implementation</a>
that holds entities in-memory, however currently this is only used for unit
tests.</p>
+<p>Queries do use the configured <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getNameField()
and the language of labels is restricted to the current language or labels
that do not define any language.</p>
+<p>Only "processable" tokens are used to lookup entities. If a token is
processable is determined as follows:</p>
+<ul>
+<li>If POS tags are available the "Boolean processPOS(String posTag)" method
of the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java">AnalysedContent</a>
is used to check if a Token needs to be processed.</li>
+<li>If this method returns NULL or no POS tags are available, then all Tokens
longer than <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getMinSearchTokenLength()
(default=3) are considered as processable.</li>
+</ul>
+<p>Typically the next MAX_SEARCH_TOKENS processable tokens are used for a
lookup. However the current Chunk/Sentence is never left in the search for
processable tokens.</p>
+<h3 id="matching_of_found_entities">Matching of found Entities:</h3>
+<p>All labels (values of the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getNameField()
field) in the language of the content or without any defined language are
candidates for matches.</p>
+<p>For each label that fulfills the above criteria the following steps are
processed. The best result is used as the result of the whole matching
process:</p>
+<ul>
+<li>All tokens (of the text) following the current position are searched
within the label.</li>
+<li>As of now, tokens MUST appear in the correct order within a label (e.g.
"Murdoch Rupert" will NOT match "Rupert Murdoch")</li>
+<li>On the first processable token of the text that is not present within the
label matching is canceled. (see the definition of processable token above)</li>
+<li>On the second non-processable token not found in the label the matching is
also canceled (e.g. "University of Michigan" will match "University
Michigan")</li>
+</ul>
+<p>Entities are <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java">Suggested</a>
if:</p>
+<ul>
+<li>a label does match exactly with the text following the current position it
the entity is suggested. (e.g. <a
href="http://en.wikipedia.org/wiki/Passerine">Passerine</a>)</li>
+<li>a label matches at least <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getMinFoundTokens()
(default=2) are matching with the text. This ensures that "<a
href="http://en.wikipedia.org/wiki/Rupert_Murdoch">Rupert Murdoch</a>" is not
suggested for "<a href="http://en.wikipedia.org/wiki/Rupert">Rupert</a>" but on
the other hand "Barack Hussein Obama" is suggested for "Barack Obama". Setting
"minFoundToken" to values less than two will usually cause a lot of false
positives, but would also come up with a suggestion for "Barack Obama" if the
content contains the word "Obama".</li>
+</ul>
+<p>The described matching process is currently directly part of the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java">EntityLinker</a>.
To support different matching strategies this would need to be externalized
into an own "EntityLabelMatcher" interface.</p>
+<h3 id="processing_of_entity_suggestions">Processing of Entity Suggestions</h3>
+<p>In case there are one or more <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java">Suggestion</a>s
of Entities for the current position within the text a <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/LinkedEntity.java">LinkedEntity</a>
instance is created.</p>
+<p>LinkedEntity is an object model representing the Stanbol Enhancement
Structure. After the processing of the parsed content is completed, the
LinkedEntities are "serialized" as RDF triples to the metadata of the
ContentItem.</p>
+<p>TextAnnotations as defined in the <a
href="http://wiki.iks-project.eu/index.php/EnhancementStructure">Stanbol
Enhancement Structure</a> do use the <a
href="http://www.dublincore.org/documents/dcmi-terms/#terms-type">dc:type</a>
property to provide the general type of the extracted Entity. However suggested
Entities might have very specific types. Therefore the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>
provides the possibility to map the specific types of the Entity to types used
for the dc:type property of TextAnnotations. The <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.DEFAULT_ENTITY_TYPE_MAPPINGS
contains some pred
efined mappings.
+<em>Note that the field used to retrieve the types of an suggested Entity can
be configured by the EntityLinkerConfig. The default value for the type field
is "rdf:type".</em></p>
+<p>In some cases suggested entities might redirect to others. In the case of
Wikipedia/DBpedia this is often used to link from acronyms like <a
href="http://en.wikipedia.org/w/index.php?title=IMF&redirect=no">IMF</a> to
the real entity <a
href="http://en.wikipedia.org/wiki/International_Monetary_Fund">International
Monetary Fund</a>. But also some Thesauri define labels as own Entities with an
URI and users might want to use the URI of the Concept rather than one of the
label.
+To support such use cases the KeywordLinkingEngine has support for redirects.
Users can first configure the redirect mode (ignore, copy values, follow) and
secondly the field used to search for redirects (default=rdfs:seeAlso).
+If the redirect mode != ignore for each suggestion the Entities referenced by
the configured redirect field are retrieved. In case of the "copy values" mode
the values of the name, and type field are copied. In case of the "follow" mode
the suggested entity is replaced with the first redirected entity.</p>
+<h3 id="confidence_for_suggestions">Confidence for Suggestions</h3>
+<p>The confidence for suggestions is calculated based on the following
algorithm:</p>
+<p>Input Parameters</p>
+<ul>
+<li>max_matched: maximum number of the matched tokens of all suggestions e.g.
the text contains "Barack Obama" -> 2</li>
+<li>matched: number of tokens that match for the current suggestion e.g.
"Barack Hussein Obama" -> 2</li>
+<li>span: number of tokens selected by the current suggestion e.g. "Barack
Hussein Obama" -> 2</li>
+<li>label_tokens: number of tokens of the matched label of the current entity
(label_token) e.g. "Barack Hussein Obama" -> 3</li>
+</ul>
+<p>confidence = (match/max_matched)^2 * (matched/span) *
(matched/label_tokens)</p>
+<p>Some Examples:</p>
+<ul>
+<li>"Barack Hussein Obama" matched against the text "Barack Obama" results in
a confidence of (2/2)^2 * (2/2) * (2/3) = 0,67 </li>
+<li>"University Michigan" matched against the text "University of Michigan"
results in a confidence of (2/2)^2 * (2/3) * (2/2) = 0,67</li>
+<li>"New York City" matched against the text "New York Rangers" - assuming
that "New York Rangers" is the best match - results in a confidence of (2/3)^2
* (2/2) * (2/3) = 0,3; Note that the best match "New York Rangers" has
max_matched=3 and gets a confidence of 1.</li>
+</ul>
+<p>The calculation of the confidence is currently direct part of the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java">EntityLinker</a>.
To support different matching strategies this would need to be externalized
into an own interface.</p>
+<h2 id="future_plans_for_the_taxonomylinkingengine">Future Plans for the
TaxonomyLinkingEngine</h2>
+<p>The TaxonomyLinkingEngine is still available and fully functional. However
it is marked as deprecated and not included in any of the launchers. Current
users are encouraged to switch over to the KeywordLinkingEngine. </p>
+<p>In the future it is planed to repurpose the TaxonomyLinkingEngine as a
special version of the KeywordLinkingEngine with a specialized configuration
and feature set targeted for (hierarchical) Taxonomies. </p>
+<p>This will include: </p>
+<ul>
+<li>default configuration specific for SKOS</li>
+<li>support for term hierarchies - adding suggestions for parent concepts</li>
+<li>support for restricting enhancements to a specific Taxonomy
(skos:ConceptScheme) - this would allow to index several taxonomies in the same
ReferencedSite but still use only a specific one for the enhancements.</li>
+</ul>
+ </div>
+
+ <div id="footer">
+ <div class="copyright">
+ <p>
+ Copyright © 2010 The Apache Software Foundation, Licensed under
+ the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache
License, Version 2.0</a>.
+ <br />
+ Apache, Stanbol and the Apache feather and Stanbol logos are
trademarks of The Apache Software Foundation.
+ </p>
+ </div>
+ </div>
+
+</body>
+</html>