customvocabulary.html

agruber Fri, 23 Sep 2011 01:26:38 -0700

Author: agruber
Date: Fri Sep 23 08:26:12 2011
New Revision: 796155

Log:
updated custom vocabulary case description


Modified:
    websites/production/stanbol/   (props changed)
    websites/production/stanbol/content/stanbol/docs/trunk/customvocabulary.html

Propchange: websites/production/stanbol/
------------------------------------------------------------------------------
--- svn:mergeinfo (original)
+++ svn:mergeinfo Fri Sep 23 08:26:12 2011
@@ -1 +1 @@
-/websites/staging/stanbol/trunk:779452-796113
+/websites/staging/stanbol/trunk:779452-796154

Modified: 
websites/production/stanbol/content/stanbol/docs/trunk/customvocabulary.html
==============================================================================
--- 
websites/production/stanbol/content/stanbol/docs/trunk/customvocabulary.html 
(original)
+++ 
websites/production/stanbol/content/stanbol/docs/trunk/customvocabulary.html 
Fri Sep 23 08:26:12 2011
@@ -49,7 +49,7 @@
     <p>The ability to work with custom vocabularies is necessary for many 
organisations. Use cases range from being able to detect various types of named 
entities specific of a company or to detect and work with concepts from a 
specific domain.</p>
 <p>For text enhancement and linking to external sources, the Entityhub 
component of Apache Stanbol allows to work with local indexes of datasets for 
several reasons: </p>
 <ul>
-<li>do not want to rely on internet connectivity to these services, thus 
working offline with a huge set of enties</li>
+<li>do not want to rely on internet connectivity to these services, thus 
working offline with a huge set of entities</li>
 <li>want to manage local updates of these public repositories and </li>
 <li>want to work with local resources only, such as your LDAP directory or a 
specific and private enterprise vocabulary of a specific domain.</li>
 </ul>
@@ -60,7 +60,7 @@
 <h3 id="a_create_your_own_index">A. Create your own index</h3>
 <p><strong>Step 1 : Create the indexing tool</strong></p>
 <p>The indexing tool provides a default configuration for creating a SOLr 
index of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf 
files).</p>
-<p>If not yet built during the Stanbol build process of the entityhub call</p>
+<p>If not yet built during the Stanbol build process of the Entityhub call</p>
 <div class="codehilite"><pre><span class="n">mvn</span> <span 
class="n">install</span>
 </pre></div>
 
@@ -84,7 +84,7 @@
 
 <p>You will get a directory with the default configuration files, one for the 
sources and a distribution directory for the resulting files. Make sure, that 
you adapt the default configuration with at least </p>
 <ul>
-<li>the id/name and licence information of your data and </li>
+<li>the id/name and license information of your data and </li>
 <li>namespaces and properties mapping you want to include to the index (see 
example of a <a href="examples/anl-mappings.txt">mappings.txt</a> including 
default and specific mappings for one dataset)</li>
 </ul>
 <p>Then, copy your source files into the respective directory 
<code>indexing/resources/rdfdata</code>. Several standard formats for RDF, 
multiple files and archives of them are supported. </p>
@@ -95,42 +95,43 @@
 
 
 <p>Depending on your hardware and on complexity and size of your sources, it 
may take several hours to built the index. As a result, you will get an archive 
of a <a href="http://lucene.apache.org/solr/";>SOLr</a> index together with an 
OSGI bundle to work with the index in Stanbol.</p>
-<p><strong>Step 3 : Initialise the index within Stanbol</strong></p>
+<p><strong>Step 3 : Initialize the index within Stanbol</strong></p>
 <p>At your running Stanbol instance, copy the ZIP archive into 
<code>{root}/sling/datafiles</code>. Then, at the "Bundles" tab of the 
administration console add and start the 
<code>org.apache.stanbol.data.site.{name}-{version}.jar</code>.</p>
 <h3 id="b_configure_and_use_the_index_with_enhancement_engines">B. Configure 
and use the index with enhancement engines</h3>
-<p>Before you can make use of the custom vocabulary you need to decide, which 
kind of enhancements you want to support. If your enhancements are 
NamedEntities in its more strict sense (Persons, Locations, Organizations), 
then you can may use the standard NER engine together with its 
EntityLinkingEngine to configure the destination of your links.</p>
-<p>In such cases, where you want to match all kinds of named entities and 
concepts from your custom vocabulary, you should work with the 
TaxonomyLinkingEngine to both, find occurrences and to link them to custom 
entities. In this case, you'll get only results, if there is a match, while in 
the case above, you even get entities, where you don't find exact links. This 
approach will have its advantages when you need to have a high recall rate on 
your custom entities.</p>
+<p>Before you can make use of the custom vocabulary you need to decide, which 
kind of enhancements you want to support. If your enhancements are Named 
Entities in its strict sense (Persons, Locations, Organizations), then you may 
use the standard NER engine together with its EntityLinkingEngine to configure 
the destination of your links.</p>
+<p>In cases, where you want to match all kinds of named entities and concepts 
from your custom vocabulary, you should work with the <a 
href="enhancer/engines/keywordlinkingengine.html">KeywordLinkingEngine</a> to 
both, find occurrences and to link them to custom entities. In this case, 
you'll get only results, if there is a match, while in the case above, you even 
get entities, where you don't find exact links. This approach will have its 
advantages when you need to have a high recall rate on your custom entities.</p>
 <p>In the following the configuration options are described briefly.</p>
-<p><strong>Use the TaxonomyLinkingEngine only</strong></p>
-<p>(1) To make sure, that the enhancement process uses the TaxonomyEngine 
only, deactivate the "standard NLP" enhancement engines, especially the 
NamedEntityExtractionEnhancementEngine (NER) and the EntityLinkingEngine before 
to work with the TaxonomyLinkingEngine.</p>
-<p>(2) Open the configuration console at 
http://localhost:8080/system/console/configMgr and navigate to the 
TaxonomyLinkingEngine. Its main options are configurable via the UI.</p>
+<p><strong>Use the KeywordLinkingEngine only</strong></p>
+<p>(1) To make sure, that the enhancement process uses the 
KeywordLinkingEngine only, deactivate the "standard NLP" enhancement engines, 
especially the NamedEntityExtractionEnhancementEngine (NER) and the 
EntityLinkingEngine before to work with the TaxonomyLinkingEngine.</p>
+<p>(2) Open the configuration console at 
http://localhost:8080/system/console/configMgr and navigate to the 
KeywordLinkingEngine. Its main options are configurable via the UI.</p>
 <ul>
 <li>Referenced Site: {put the id/name of your index}</li>
-<li>Label Field: {the property to search for} </li>
-<li>Use Simple Tokenizer: {deactivate to use language specific tokenizers}</li>
+<li>Label Field: {the property to search for}</li>
+<li>Type Field: {types of matched entries} </li>
+<li>Redirect Field: {redirection links}</li>
+<li>Redirect Mode: {ignore, follow, add values}</li>
 <li>Min Token Length: {set minimal token length}</li>
-<li>Use Chunker: {disable/enable language specific chunkers}</li>
 <li>Suggestions: {maximum number of suggestions}</li>
-<li>Number of Required Tokens: {minimal required tokens}</li>
+<li>Languages: {languages to use}</li>
 </ul>
-<p><em>For further details please on the engine and its configuration please 
refer to the according README at 
<code>{root}/stanbol/enhancer/engines/taxonomylinking/</code>.</em> (TODO: 
create the Readme)</p>
-<p><strong>Use several instances of the TaxonomyLinkingEngine</strong></p>
-<p>To work at the same time with different instances of the 
TaxonomyLinkingEngine can be useful in cases, where you have two or more 
distinct custom vocabularies/indexes and/or if you want to combine your 
specific domain vocabulary with general purpose datasets such as dbpedia or 
others.</p>
-<p><strong>Use the TaxonomyLinkingEngine together with the NER engine and the 
EntityLinkingEngine</strong></p>
-<p>If your text corpus contains and you are interested in both, generic 
NamedEntities and custom thesaurus you may use (TODO)<br />
-</p>
-<h2 id="specific_examples">Specific Examples</h2>
+<p><em>Full details on the engine and its configuration are available <a 
href="enhancer/engines/keywordextraction.html">here</a>.</em></p>
+<p><strong>Use several instances of the KeywordLinkingEngine</strong></p>
+<p>To work at the same time with different instances of the 
KeywordLinkingEngine can be useful in cases, where you have two or more 
distinct custom vocabularies/indexes and/or if you want to combine your 
specific domain vocabulary with general purpose datasets such as dbpedia or 
others.</p>
+<p><strong>Use the KeywordLinkingEngine together with the NER engine and the 
EntityLinkingEngine</strong></p>
+<p>If your text corpus contains common entities and enterprise specific as 
well and you are interested getting enhancements for both, you may also use the 
KeywordLinkingEngine for your custom thesaurus and the NERengine together with 
the EntityLinkingEngine targeting at e.g. dbpedia at the same time. </p>
+<h2 id="examples">Examples</h2>
 <p>You can find guidance for the following indexers in the README files at 
<code>{root}/entityhub/indexing/{name-for-indexer}</code></p>
 <ul>
-<li><a href="http://dbpedia.org/";>DBpedia</a> dataset (Wikipedia data)</li>
+<li><a href="http://dbpedia.org/";>dbpedia</a> dataset (Wikipedia data)
+ For dbpedia, there is also a <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh";>script</a>
 available, which helps in generating your own dbpedia index.</li>
 <li><a href="http://www.geonames.org";>geonames.org</a> dataset (geolocation 
data)</li>
 <li><a href="http://dblp.uni-trier.de/";>DBLP</a> dataset (scientific 
bibliography data)</li>
 </ul>
-<h2 id="demos_and_ressources">Demos and Ressources</h2>
+<h2 id="demos_and_resources">Demos and Resources</h2>
 <ul>
 <li>The full <a href="http://dev.iks-project.eu:8081/";>demo</a> installation 
of Stanbol is configured to also work with an environmental thesaurus - if you 
test it with unstructured text from the domain, you should get enhancements 
with additional results for specific "concepts".</li>
 <li>Download custom test indexes and installer bundles for Stanbol from <a 
href="http://dev.iks-project.eu/downloads/stanbol-indices/";>here</a> (e.g. for 
GEMET environmental thesaurus, or a big dbpedia index).</li>
-<li>Another concrete example with metadata from the Austrian National Library 
is described (TODO: link) here.</li>
+<li>A very concrete example using metadata from the Austrian National Library 
is described <a 
href="http://blog.iks-project.eu/using-custom-vocabularies-with-apache-stanbol/";>here</a>.</li>
 </ul>
   </div>

svn commit: r796155 - in /websites/production/stanbol: ./ content/stanbol/docs/trunk/customvocabulary.html

Reply via email to