Author: agruber
Date: Fri Sep 16 09:59:13 2011
New Revision: 1171482
URL: http://svn.apache.org/viewvc?rev=1171482&view=rev
Log:
updated customvocabulary description with examples
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext?rev=1171482&r1=1171481&r2=1171482&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext
Fri Sep 16 09:59:13 2011
@@ -1,25 +1,28 @@
Title: Using custom/local vocabularies with Apache Stanbol
-For text enhancement and linking to external sources, the Entityhub provides
you with the possibility to work with local indexes of datasets for several
reasons. Firstly, you do not want to rely on internet connectivity to these
services, secondly you may want to manage local changes to these public
repository and thirdly, you may want to work with local resources only, such as
your LDAP directory or a specific and private enterprise vocabulary of your
domain.
+The ability to work with custom vocabularies is necessary for many
organisations. Use cases range from being able to detect various types of named
entities specific of a company or to detect and work with concepts from a
specific domain.
-The main other possibility is to upload ontologies to the ontology manager and
to use the reasoning components over it.
+For text enhancement and linking to external sources, the Entityhub component
of Apache Stanbol allows to work with local indexes of datasets for several
reasons:
-This document focuses on two cases:
+- do not want to rely on internet connectivity to these services, thus working
offline with a huge set of enties
+- want to manage local updates of these public repositories and
+- want to work with local resources only, such as your LDAP directory or a
specific and private enterprise vocabulary of a specific domain.
-- Creating and using a local SOLr index of a given vocabulary e.g. a SKOS
thesaurus or taxonomy of your domain
-- Directly working with individual instance entities from given ontologies
e.g. a FOAF repository.
+Creating your custom indexes the preferred way of working with custom
vocabularies. For small vocabularies, with Entithub one can also upload simple
ontologies together instance data directly to the Entityhub and manage them -
but as a major downside to this approach, one can only manage one ontology per
installation.
-## Creating and working with local indexes
+This document focuses on the main case: Creating and using a local SOLr
indexes of a custom vocabularies e.g. a SKOS thesaurus or taxonomy of your
domain.
-The ability to work with custom vocabularies in Stanbol is necessary for many
organizational use cases such as beeing able to detect various types of named
entities specific to a company or to detect and work with concepts from a
specific domain. Stanbol provides the machinery to start with vocabularies in
standard languages such as [SKOS - Simple Knowledge Organization
Systems](http://www.w3.org/2004/02/skos/) or more general
[RDF](http://www.w3.org/TR/rdf-primer/) encoded data sets. The respective
Stanbol components, which are needed for this functionality are the Entityhub
for creating and managing the index and several [Enhancement
Engines](engines.html) to make use of the index during the enhancement process.
+## Creating and working with custom local indexes
-### Create your own index
+Stanbol provides the machinery to start with vocabularies in standard
languages such as [SKOS - Simple Knowledge Organization
Systems](http://www.w3.org/2004/02/skos/) or more general
[RDF](http://www.w3.org/TR/rdf-primer/) encoded data sets. The respective
Stanbol components, which are needed for this functionality are the Entityhub
for creating and managing the index and several [Enhancement
Engines](engines.html) to make use of the indexes during the enhancement
process.
+
+### A. Create your own index
**Step 1 : Create the indexing tool**
The indexing tool provides a default configuration for creating a SOLr index
of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf files).
-(1) If not yet built during the Stanbol build process of the entityhub call
+If not yet built during the Stanbol build process of the entityhub call
mvn install
@@ -40,7 +43,14 @@ Initialize the tool with
java -jar
org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar
init
-You will get a directory with the default configuration files, one for the
sources and a distribution directory for the resulting files. Make sure, that
you adapt the default configuration with at least the name of your index and
namespaces and properties you need to include to the index and copy your source
files into the respective directory <code>indexing/resources/rdfdata</code>.
Several standard formats for RDF, multiple files and archives of them are
supported. *For details of possible configurations, please consult the
<code>{root}/entityhub/indexing/genericrdf/readme.md</code>.*
+You will get a directory with the default configuration files, one for the
sources and a distribution directory for the resulting files. Make sure, that
you adapt the default configuration with at least
+
+- the id/name and licence information of your data and
+- namespaces and properties mapping you want to include to the index (see
example of a [mappings.txt](examples/anl-mappings.txt) including default and
specific mappings for one dataset)
+
+Then, copy your source files into the respective directory
<code>indexing/resources/rdfdata</code>. Several standard formats for RDF,
multiple files and archives of them are supported.
+
+*For more details of possible configurations, please consult the README at
<code>{root}/entityhub/indexing/genericrdf/</code>.*
Then, you can start the index by running
@@ -54,7 +64,7 @@ Depending on your hardware and on comple
At your running Stanbol instance, copy the ZIP archive into
<code>{root}/sling/datafiles</code>. Then, at the "Bundles" tab of the
administration console add and start the
<code>org.apache.stanbol.data.site.{name}-{version}.jar</code>.
-### Configuring the enhancement engines
+### B. Configure and use the index with enhancement engines
Before you can make use of the custom vocabulary you need to decide, which
kind of enhancements you want to support. If your enhancements are
NamedEntities in its more strict sense (Persons, Locations, Organizations),
then you can may use the standard NER engine together with its
EntityLinkingEngine to configure the destination of your links.
@@ -69,15 +79,15 @@ In the following the configuration optio
(2) Open the configuration console at
http://localhost:8080/system/console/configMgr and navigate to the
TaxonomyLinkingEngine. Its main options are configurable via the UI.
-- Referenced Site: {put the id/name of your index} (required)
-- Label Field: {the property to search for}
+- Referenced Site: {put the id/name of your index}
+- Label Field: {the property to search for}
- Use Simple Tokenizer: {deactivate to use language specific tokenizers}
- Min Token Length: {set minimal token length}
- Use Chunker: {disable/enable language specific chunkers}
- Suggestions: {maximum number of suggestions}
- Number of Required Tokens: {minimal required tokens}
-*For further details please on the engine and its configuration please consult
the according Readme file at TODO: create the readme
<code>{root}/stanbol/enhancer/engines/taxonomylinking/<code>.*
+*For further details please on the engine and its configuration please refer
to the according README at
<code>{root}/stanbol/enhancer/engines/taxonomylinking/</code>.* (TODO: create
the Readme)
**Use several instances of the TaxonomyLinkingEngine**
@@ -87,28 +97,18 @@ To work at the same time with different
**Use the TaxonomyLinkingEngine together with the NER engine and the
EntityLinkingEngine**
-If your text corpus contains and you are interested in both, generic
NamedEntities and custom thesaurus you may use
-
-
-
-### Demos and Examples
-
-- The full demo installation of Stanbol is configured to also work with an
environmental thesaurus - if you test it with unstructured text from the
domain, you should get enhancements with additional results for specific
"concepts".
-- One example can be found with metadata from the Austrian National Library is
described (TODO: link) here.
-
-(TODO) - Examples
-
+If your text corpus contains and you are interested in both, generic
NamedEntities and custom thesaurus you may use (TODO)
-## Create a custom index for dbpedia
-(TODO) dbpedia indexing (<-- olivier)
+## Specific Examples
+**Create your custom index for dbpedia:** (TODO: dbpedia indexing (<--
olivier))
-## Working with ontologies in EntityHub
-(TODO)
+## Resources
-### Demos and Examples
+- The full [demo](http://dev.iks-project.eu:8081/) installation of Stanbol is
configured to also work with an environmental thesaurus - if you test it with
unstructured text from the domain, you should get enhancements with additional
results for specific "concepts".
+- Download custom test indexes and installer bundles for Stanbol from
[here](http://dev.iks-project.eu/downloads/stanbol-indices/) (e.g. for GEMET
environmental thesaurus, or a big dbpedia index).
+- Another concrete example with metadata from the Austrian National Library is
described (TODO: link) here.
-(TODO)
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt?rev=1171482&view=auto
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
(added)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
Fri Sep 16 09:59:13 2011
@@ -0,0 +1,164 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+#NOTE: THIS IS A DEFAULT MAPPING SPECIFICATION THAT INCLUDES MAPPINGS FOR
+# COMMON ONTOLOGIES. USERS MIGHT WANT TO ADAPT THIS CONFIGURATION AB
+# COMMENTING/UNCOMMENTING AND/OR ADDING NEW MAPPINGS
+
+# --- Define the Languages for all fields ---
+# to restrict languages to be imported (for all fields)
+#| @=null;en;de;fr;it
+
+#NOTE: null is used to import labels with no specified language
+
+# to import all languages leave this empty
+
+# --- RDF RDFS and OWL Mappings ---
+# This configuration only index properties that are typically used to store
+# instance data defined by such namespaces. This excludes Ontology definitions
+
+# NOTE that nearly all other ontologies are are using properties of these three
+# schemas, therefore it is strongly recommended to include such
information!
+
+rdf:type | d=entityhub:ref
+
+rdfs:label
+rdfs:comment
+rdfs:seeAlso | d=entityhub:ref
+
+
+owl:sameAs | d=entityhub:ref
+
+#If one likes to also index Ontologies one should add the following statements
+#owl:*
+#rdfs:*
+
+# --- Dublin Core (DC) ---
+# The default configuration imports all dc-terms data and copies vlaues for the
+# old dc-elements standard over to the according properties ofthe dc-terms
+#standard.
+
+# NOTE that a lot of other ontologies are also using DC for some of there data
+# therefore it is strongly recommended to include such information!
+
+#mapping for all dc-terms properties
+dc:*
+
+# copy dc:title to rdfs:label
+dc:title > rdfs:label
+
+# deactivated by default, because such mappings are mapped to dc-terms
+#dc-elements:*
+
+# mappings for the dc-elements properties to the dc-terms
+dc-elements:contributor > dc:contributor
+dc-elements:coverage > dc:coverage
+dc-elements:creator > dc:creator
+dc-elements:date > dc:date
+dc-elements:description > dc:description
+dc-elements:format > dc:format
+dc-elements:identifier > dc:identifier
+dc-elements:language > dc:language
+dc-elements:publisher > dc:publisher
+dc-elements:relation > dc:relation
+dc-elements:rights > dc:rights
+dc-elements:source > dc:source
+dc-elements:subject > dc:subject
+dc-elements:title > dc:title
+dc-elements:type > dc:type
+#also use ec-elements:title as label
+dc-elements:title > rdfs:label
+
+# --- Social Networks (via foaf) ---
+#The Friend of a Friend schema often used to describe social relations between
people
+foaf:*
+
+# copy the name of a person over to rdfs:label
+foaf:name > rdfs:label
+
+# additional data types checks
+foaf:knows | d=entityhub:ref
+foaf:made | d=entityhub:ref
+foaf:maker | d=entityhub:ref
+foaf:member | d=entityhub:ref
+foaf:homepage | d=xsd:anyURI
+foaf:depiction | d=xsd:anyURI
+foaf:img | d=xsd:anyURI
+foaf:logo | d=xsd:anyURI
+#page about the entity
+foaf:page | d=xsd:anyURI
+
+
+# --- Simple Knowledge Organization System (SKOS) ---
+
+# A common data model for sharing and linking knowledge organization systems
+# via the Semantic Web. Typically used to encode controlled vocabularies auch
as
+# a thesaurus
+skos:*
+
+# copy the preferred label over to rdfs:label
+skos:prefLabel > rdfs:label
+
+# copy values of **Match relations to the according related, broader and
narrower
+skos:relatedMatch > skos:related
+skos:broadMatch > skos:broader
+skos:narrowMatch > skos:skos:narrower
+
+#similar mappings for transitive variants are not contained, because transitive
+#reasoning is not directly supported by the Entityhub.
+
+# Some SKOS thesaurus do use "skos:transitiveBroader" and
"skos:transitiveNarrower"
+# however such properties are only intended to be used by reasoners to
+# calculate transitive closures over broader/narrower hierarchies.
+# see http://www.w3.org/TR/skos-reference/#L2413 for details
+# to correct such cases we will copy transitive relations to there counterpart
+skos:narrowerTransitive > skos:narrower
+skos:broaderTransitive > skos:broader
+
+
+# --- Semantically-Interlinked Online Communities (SIOC) ---
+
+# an ontology for describing the information in online communities.
+# This information can be used to export information from online communities
+# and to link them together. The scope of the application areas that SIOC can
+# be used for includes (and is not limited to) weblogs, message boards,
+# mailing lists and chat channels.
+sioc:*
+
+# --- biographical information (bio)
+# A vocabulary for describing biographical information about people, both
living
+# and dead. (see http://vocab.org/bio/0.1/)
+bio:*
+
+# --- Rich Site Summary (rss) ---
+rss:*
+
+# --- GoodRelations (gr) ---
+# GoodRelations is a standardised vocabulary for product, price, and company
data
+gr:*
+
+# --- Creative Commons Rights Expression Language (cc)
+# The Creative Commons Rights Expression Language (CC REL) lets you describe
+# copyright licenses in RDF.
+cc:*
+
+# --- Additional namespaces added for the Europeana dataset
(http://ckan.net/dataset/europeana-lod) ---
+http://www.europeana.eu/schemas/edm/*
+http://www.openarchives.org/ore/terms/*
+
+
+
+
+