anl-mappings.txt

buildbot Fri, 16 Sep 2011 02:59:57 -0700

Author: buildbot
Date: Fri Sep 16 09:59:27 2011
New Revision: 795878

Log:
Staging update by buildbot


Added:
    websites/staging/stanbol/trunk/content/stanbol/docs/trunk/examples/
    
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
Modified:
    
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html

Modified: 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html
==============================================================================
--- 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html 
(original)
+++ 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html 
Fri Sep 16 09:59:27 2011
@@ -46,19 +46,21 @@
   
   <div id="content">
     <h1 class="title">Using custom/local vocabularies with Apache Stanbol</h1>
-    <p>For text enhancement and linking to external sources, the Entityhub 
provides you with the possibility to work with local indexes of datasets for 
several reasons. Firstly, you do not want to rely on internet connectivity to 
these services, secondly you may want to manage local changes to these public 
repository and thirdly, you may want to work with local resources only, such as 
your LDAP directory or a specific and private enterprise vocabulary of your 
domain.</p>
-<p>The main other possibility is to upload ontologies to the ontology manager 
and to use the reasoning components over it.</p>
-<p>This document focuses on two cases:</p>
+    <p>The ability to work with custom vocabularies is necessary for many 
organisations. Use cases range from being able to detect various types of named 
entities specific of a company or to detect and work with concepts from a 
specific domain.</p>
+<p>For text enhancement and linking to external sources, the Entityhub 
component of Apache Stanbol allows to work with local indexes of datasets for 
several reasons: </p>
 <ul>
-<li>Creating and using a local SOLr index of a given vocabulary e.g. a SKOS 
thesaurus or taxonomy of your domain</li>
-<li>Directly working with individual instance entities from given ontologies 
e.g. a FOAF repository.</li>
+<li>do not want to rely on internet connectivity to these services, thus 
working offline with a huge set of enties</li>
+<li>want to manage local updates of these public repositories and </li>
+<li>want to work with local resources only, such as your LDAP directory or a 
specific and private enterprise vocabulary of a specific domain.</li>
 </ul>
-<h2 id="creating_and_working_with_local_indexes">Creating and working with 
local indexes</h2>
-<p>The ability to work with custom vocabularies in Stanbol is necessary for 
many organizational use cases such as beeing able to detect various types of 
named entities specific to a company or to detect and work with concepts from a 
specific domain. Stanbol provides the machinery to start with vocabularies in 
standard languages such as <a href="http://www.w3.org/2004/02/skos/";>SKOS - 
Simple Knowledge Organization Systems</a> or more general <a 
href="http://www.w3.org/TR/rdf-primer/";>RDF</a> encoded data sets. The 
respective Stanbol components, which are needed for this functionality are the 
Entityhub for creating and managing the index and several <a 
href="engines.html">Enhancement Engines</a> to make use of the index during the 
enhancement process.</p>
-<h3 id="create_your_own_index">Create your own index</h3>
+<p>Creating your custom indexes the preferred way of working with custom 
vocabularies. For small vocabularies, with Entithub one can also upload simple 
ontologies together instance data directly to the Entityhub and manage them - 
but as a major downside to this approach, one can only manage one ontology per 
installation.</p>
+<p>This document focuses on the main case: Creating and using a local SOLr 
indexes of a custom vocabularies e.g. a SKOS thesaurus or taxonomy of your 
domain.</p>
+<h2 id="creating_and_working_with_custom_local_indexes">Creating and working 
with custom local indexes</h2>
+<p>Stanbol provides the machinery to start with vocabularies in standard 
languages such as <a href="http://www.w3.org/2004/02/skos/";>SKOS - Simple 
Knowledge Organization Systems</a> or more general <a 
href="http://www.w3.org/TR/rdf-primer/";>RDF</a> encoded data sets. The 
respective Stanbol components, which are needed for this functionality are the 
Entityhub for creating and managing the index and several <a 
href="engines.html">Enhancement Engines</a> to make use of the indexes during 
the enhancement process.</p>
+<h3 id="a_create_your_own_index">A. Create your own index</h3>
 <p><strong>Step 1 : Create the indexing tool</strong></p>
 <p>The indexing tool provides a default configuration for creating a SOLr 
index of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf 
files).</p>
-<p>(1) If not yet built during the Stanbol build process of the entityhub 
call</p>
+<p>If not yet built during the Stanbol build process of the entityhub call</p>
 <div class="codehilite"><pre><span class="n">mvn</span> <span 
class="n">install</span>
 </pre></div>
 
@@ -80,7 +82,13 @@
 </pre></div>
 
 
-<p>You will get a directory with the default configuration files, one for the 
sources and a distribution directory for the resulting files. Make sure, that 
you adapt the default configuration with at least the name of your index and 
namespaces and properties you need to include to the index and copy your source 
files into the respective directory <code>indexing/resources/rdfdata</code>. 
Several standard formats for RDF, multiple files and archives of them are 
supported. <em>For details of possible configurations, please consult the 
<code>{root}/entityhub/indexing/genericrdf/readme.md</code>.</em></p>
+<p>You will get a directory with the default configuration files, one for the 
sources and a distribution directory for the resulting files. Make sure, that 
you adapt the default configuration with at least </p>
+<ul>
+<li>the id/name and licence information of your data and </li>
+<li>namespaces and properties mapping you want to include to the index (see 
example of a <a href="examples/anl-mappings.txt">mappings.txt</a> including 
default and specific mappings for one dataset)</li>
+</ul>
+<p>Then, copy your source files into the respective directory 
<code>indexing/resources/rdfdata</code>. Several standard formats for RDF, 
multiple files and archives of them are supported. </p>
+<p><em>For more details of possible configurations, please consult the README 
at <code>{root}/entityhub/indexing/genericrdf/</code>.</em></p>
 <p>Then, you can start the index by running</p>
 <div class="codehilite"><pre><span class="n">java</span> <span 
class="o">-</span><span class="n">Xmx1024m</span> <span class="o">-</span><span 
class="n">jar</span> <span class="n">org</span><span class="o">.</span><span 
class="n">apache</span><span class="o">.</span><span 
class="n">stanbol</span><span class="o">.</span><span 
class="n">entityhub</span><span class="o">.</span><span 
class="n">indexing</span><span class="o">.</span><span 
class="n">dblp</span><span class="o">-*-</span><span class="n">jar</span><span 
class="o">-</span><span class="n">with</span><span class="o">-</span><span 
class="n">dependencies</span><span class="o">.</span><span class="n">jar</span> 
<span class="nb">index</span>
 </pre></div>
@@ -89,7 +97,7 @@
 <p>Depending on your hardware and on complexity and size of your sources, it 
may take several hours to built the index. As a result, you will get an archive 
of a <a href="http://lucene.apache.org/solr/";>SOLr</a> index together with an 
OSGI bundle to work with the index in Stanbol.</p>
 <p><strong>Step 3 : Initialise the index within Stanbol</strong></p>
 <p>At your running Stanbol instance, copy the ZIP archive into 
<code>{root}/sling/datafiles</code>. Then, at the "Bundles" tab of the 
administration console add and start the 
<code>org.apache.stanbol.data.site.{name}-{version}.jar</code>.</p>
-<h3 id="configuring_the_enhancement_engines">Configuring the enhancement 
engines</h3>
+<h3 id="b_configure_and_use_the_index_with_enhancement_engines">B. Configure 
and use the index with enhancement engines</h3>
 <p>Before you can make use of the custom vocabulary you need to decide, which 
kind of enhancements you want to support. If your enhancements are 
NamedEntities in its more strict sense (Persons, Locations, Organizations), 
then you can may use the standard NER engine together with its 
EntityLinkingEngine to configure the destination of your links.</p>
 <p>In such cases, where you want to match all kinds of named entities and 
concepts from your custom vocabulary, you should work with the 
TaxonomyLinkingEngine to both, find occurrences and to link them to custom 
entities. In this case, you'll get only results, if there is a match, while in 
the case above, you even get entities, where you don't find exact links. This 
approach will have its advantages when you need to have a high recall rate on 
your custom entities.</p>
 <p>In the following the configuration options are described briefly.</p>
@@ -97,32 +105,28 @@
 <p>(1) To make sure, that the enhancement process uses the TaxonomyEngine 
only, deactivate the "standard NLP" enhancement engines, especially the 
NamedEntityExtractionEnhancementEngine (NER) and the EntityLinkingEngine before 
to work with the TaxonomyLinkingEngine.</p>
 <p>(2) Open the configuration console at 
http://localhost:8080/system/console/configMgr and navigate to the 
TaxonomyLinkingEngine. Its main options are configurable via the UI.</p>
 <ul>
-<li>Referenced Site: {put the id/name of your index} (required)</li>
-<li>Label Field: {the property to search for}</li>
+<li>Referenced Site: {put the id/name of your index}</li>
+<li>Label Field: {the property to search for} </li>
 <li>Use Simple Tokenizer: {deactivate to use language specific tokenizers}</li>
 <li>Min Token Length: {set minimal token length}</li>
 <li>Use Chunker: {disable/enable language specific chunkers}</li>
 <li>Suggestions: {maximum number of suggestions}</li>
 <li>Number of Required Tokens: {minimal required tokens}</li>
 </ul>
-<p><em>For further details please on the engine and its configuration please 
consult the according Readme file at TODO: create the readme 
<code>{root}/stanbol/enhancer/engines/taxonomylinking/<code>.</em></p>
+<p><em>For further details please on the engine and its configuration please 
refer to the according README at 
<code>{root}/stanbol/enhancer/engines/taxonomylinking/</code>.</em> (TODO: 
create the Readme)</p>
 <p><strong>Use several instances of the TaxonomyLinkingEngine</strong></p>
 <p>To work at the same time with different instances of the 
TaxonomyLinkingEngine can be useful in cases, where you have two or more 
distinct custom vocabularies/indexes and/or if you want to combine your 
specific domain vocabulary with general purpose datasets such as dbpedia or 
others.</p>
 <p><strong>Use the TaxonomyLinkingEngine together with the NER engine and the 
EntityLinkingEngine</strong></p>
-<p>If your text corpus contains and you are interested in both, generic 
NamedEntities and custom thesaurus you may use <br />
+<p>If your text corpus contains and you are interested in both, generic 
NamedEntities and custom thesaurus you may use (TODO)<br />
 </p>
-<h3 id="demos_and_examples">Demos and Examples</h3>
+<h2 id="specific_examples">Specific Examples</h2>
+<p><strong>Create your custom index for dbpedia:</strong> (TODO: dbpedia 
indexing (&lt;-- olivier))</p>
+<h2 id="resources">Resources</h2>
 <ul>
-<li>The full demo installation of Stanbol is configured to also work with an 
environmental thesaurus - if you test it with unstructured text from the 
domain, you should get enhancements with additional results for specific 
"concepts".</li>
-<li>One example can be found with metadata from the Austrian National Library 
is described (TODO: link) here.</li>
+<li>The full <a href="http://dev.iks-project.eu:8081/";>demo</a> installation 
of Stanbol is configured to also work with an environmental thesaurus - if you 
test it with unstructured text from the domain, you should get enhancements 
with additional results for specific "concepts".</li>
+<li>Download custom test indexes and installer bundles for Stanbol from <a 
href="http://dev.iks-project.eu/downloads/stanbol-indices/";>here</a> (e.g. for 
GEMET environmental thesaurus, or a big dbpedia index).</li>
+<li>Another concrete example with metadata from the Austrian National Library 
is described (TODO: link) here.</li>
 </ul>
-<p>(TODO) - Examples</p>
-<h2 id="create_a_custom_index_for_dbpedia">Create a custom index for 
dbpedia</h2>
-<p>(TODO) dbpedia indexing (&lt;-- olivier)</p>
-<h2 id="working_with_ontologies_in_entityhub">Working with ontologies in 
EntityHub</h2>
-<p>(TODO)</p>
-<h3 id="demos_and_examples_1">Demos and Examples</h3>
-<p>(TODO)</p>
   </div>
   
   <div id="footer">

Added: 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
==============================================================================
--- 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
 (added)
+++ 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
 Fri Sep 16 09:59:27 2011
@@ -0,0 +1,164 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+#NOTE: THIS IS A DEFAULT MAPPING SPECIFICATION THAT INCLUDES MAPPINGS FOR
+#      COMMON ONTOLOGIES. USERS MIGHT WANT TO ADAPT THIS CONFIGURATION AB
+#      COMMENTING/UNCOMMENTING AND/OR ADDING NEW MAPPINGS
+
+# --- Define the Languages for all fields ---
+# to restrict languages to be imported (for all fields)
+#| @=null;en;de;fr;it
+
+#NOTE: null is used to import labels with no specified language
+
+# to import all languages leave this empty
+
+# --- RDF RDFS and OWL Mappings ---
+# This configuration only index properties that are typically used to store
+# instance data defined by such namespaces. This excludes Ontology definitions
+
+# NOTE that nearly all other ontologies are are using properties of these three
+#      schemas, therefore it is strongly recommended to include such 
information!
+
+rdf:type | d=entityhub:ref
+
+rdfs:label 
+rdfs:comment
+rdfs:seeAlso | d=entityhub:ref
+
+
+owl:sameAs | d=entityhub:ref
+
+#If one likes to also index Ontologies one should add the following statements
+#owl:*
+#rdfs:*
+
+# --- Dublin Core (DC) ---
+# The default configuration imports all dc-terms data and copies vlaues for the
+# old dc-elements standard over to the according properties ofthe dc-terms
+#standard.
+
+# NOTE that a lot of other ontologies are also using DC for some of there data
+#      therefore it is strongly recommended to include such information!
+
+#mapping for all dc-terms properties
+dc:*
+
+# copy dc:title to rdfs:label
+dc:title > rdfs:label
+
+# deactivated by default, because such mappings are mapped to dc-terms
+#dc-elements:*
+
+# mappings for the dc-elements properties to the dc-terms
+dc-elements:contributor > dc:contributor
+dc-elements:coverage > dc:coverage
+dc-elements:creator > dc:creator
+dc-elements:date > dc:date
+dc-elements:description > dc:description
+dc-elements:format > dc:format
+dc-elements:identifier > dc:identifier
+dc-elements:language > dc:language
+dc-elements:publisher > dc:publisher
+dc-elements:relation > dc:relation
+dc-elements:rights > dc:rights
+dc-elements:source > dc:source
+dc-elements:subject > dc:subject
+dc-elements:title > dc:title
+dc-elements:type > dc:type
+#also use ec-elements:title as label
+dc-elements:title > rdfs:label
+
+# --- Social Networks (via foaf) ---
+#The Friend of a Friend schema often used to describe social relations between 
people
+foaf:*
+
+# copy the name of a person over to rdfs:label
+foaf:name > rdfs:label
+
+# additional data types checks
+foaf:knows | d=entityhub:ref
+foaf:made | d=entityhub:ref
+foaf:maker | d=entityhub:ref
+foaf:member | d=entityhub:ref
+foaf:homepage | d=xsd:anyURI
+foaf:depiction | d=xsd:anyURI
+foaf:img | d=xsd:anyURI
+foaf:logo | d=xsd:anyURI
+#page about the entity
+foaf:page | d=xsd:anyURI
+
+
+# --- Simple Knowledge Organization System (SKOS) ---
+
+# A common data model for sharing and linking knowledge organization systems 
+# via the Semantic Web. Typically used to encode controlled vocabularies auch 
as
+# a thesaurus  
+skos:*
+
+# copy the preferred label  over to rdfs:label
+skos:prefLabel > rdfs:label
+
+# copy values of **Match relations to the according related, broader and 
narrower
+skos:relatedMatch > skos:related
+skos:broadMatch > skos:broader
+skos:narrowMatch > skos:skos:narrower
+
+#similar mappings for transitive variants are not contained, because transitive
+#reasoning is not directly supported by the Entityhub.
+
+# Some SKOS thesaurus do use "skos:transitiveBroader" and 
"skos:transitiveNarrower"
+# however such properties are only intended to be used by reasoners to
+# calculate transitive closures over broader/narrower hierarchies.
+# see http://www.w3.org/TR/skos-reference/#L2413 for details
+# to correct such cases we will copy transitive relations to there counterpart
+skos:narrowerTransitive > skos:narrower
+skos:broaderTransitive > skos:broader
+
+
+# --- Semantically-Interlinked Online Communities (SIOC) ---
+
+# an ontology for describing the information in online communities. 
+# This information can be used to export information from online communities 
+# and to link them together. The scope of the application areas that SIOC can 
+# be used for includes (and is not limited to) weblogs, message boards, 
+# mailing lists and chat channels.
+sioc:*
+
+# --- biographical information (bio)
+# A vocabulary for describing biographical information about people, both 
living
+# and dead. (see http://vocab.org/bio/0.1/)
+bio:*
+
+# --- Rich Site Summary (rss) ---
+rss:*
+
+# --- GoodRelations (gr) ---
+# GoodRelations is a standardised vocabulary for product, price, and company 
data
+gr:*
+
+# --- Creative Commons Rights Expression Language (cc)
+# The Creative Commons Rights Expression Language (CC REL) lets you describe 
+# copyright licenses in RDF.
+cc:*
+
+# --- Additional namespaces added for the Europeana dataset 
(http://ckan.net/dataset/europeana-lod) ---
+http://www.europeana.eu/schemas/edm/*
+http://www.openarchives.org/ore/terms/*
+
+
+
+
+

svn commit: r795878 - in /websites/staging/stanbol/trunk/content/stanbol/docs/trunk: customvocabulary.html examples/ examples/anl-mappings.txt

Reply via email to