Author: agruber
Date: Thu Oct 6 08:03:12 2011
New Revision: 1179530
URL: http://svn.apache.org/viewvc?rev=1179530&view=rev
Log:
Add documentation on several engines: langID, Metaxa, NER, NET and multilingual
case description
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext
Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext?rev=1179530&r1=1179529&r2=1179530&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext
(original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext Thu
Oct 6 08:03:12 2011
@@ -2,17 +2,18 @@ Title: Enhancement Engines and their mai
## Preprocessing
-- __Language Identification Engine__
- - langage dedection for textual content using Apache Tika
+- __[Language Identification Engine](enhancer/engines/langidengine.html)__
+ - language detection for textual content utilizing [Apache
Tika](http://tika.apache.org/)
-- __Metaxa Engine__
- - text extraction from various documents
- - extraction of metadata from documents
+- __[Metaxa Engine](enhancer/engines/metaxaengine.html)__
+ - text extraction from various document formats
+ - extraction of metadata from document formats
+ -
## Natural Language Processing
-- __Named Entity Extraction Enhancement Engine__
+- __[Named Entity Extraction Enhancement
Engine](enhancer/engines/namedentityextractionengine.html)__
- NLP processing using OpenNLP NER
- detect occurrences of persons, places and organizations only
@@ -21,6 +22,7 @@ Title: Enhancement Engines and their mai
- NLP processing using OpenNLP
- supports multiple languages
- dedect occurences of untyped entities as concepts, takes local
taxonomies as linking target
+
- _Taxonomy Linking Engine_ (deprecated, see KeywordLinkingEngine)
- NLP processing using OpenNLP POS
@@ -29,7 +31,7 @@ Title: Enhancement Engines and their mai
## Linking Suggestions
-- __Named Entity Tagging Engine__
+- __[Named Entity Tagging
Engine](enhancer/engines/namedentitytaggingengine.html)__
- suggest links to several Linked Data Sources (e.g. dbpedia)
- __Location Enhancement Engine__
@@ -45,7 +47,7 @@ Title: Enhancement Engines and their mai
## Postprocessing / Other
-- __CachingDereferencerEngine__
+- _CachingDereferencerEngine_ (deprecated, see dereferencing support of
individual engines as well as
[STANBOL-336](https://issues.apache.org/jira/browse/STANBOL-336))
- retrieves additional content for presenting the enhancement results.
- __Refactor Engine__
Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext?rev=1179530&r1=1179529&r2=1179530&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext
(original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext Thu
Oct 6 08:03:12 2011
@@ -49,13 +49,17 @@ The web interface of your Apache Stanbol
## Usage Scenarios for Apache Stanbol
-* [Content Enhancement](contentenhancement.html)
+* [Basic Content Enhancement](contentenhancement.html)
Analyze textual content, enhance with with named entities (person, place,
organization), suggest links to open data sources.
* [Working with "local" Entities](customvocabulary.html)
- Use locally defined entities (e.g. thesaurus concepts) from an organization's
context.
+ Use locally defined entities (e.g. thesaurus concepts) from an organization's
context.
+
+* [Working with multiple languages](multilingual.html)
+
+ Get enhancements for textual content in multiple languages (EN, DE, SV, DA,
PT and NL).
* Semantic Search in Portals
@@ -66,7 +70,7 @@ The web interface of your Apache Stanbol
Refactor the enhancement result, its property names and ontology types
according your target ontology.
* Transforming CMS repository structures into ontologies
- Provide repository structures as thesaurus or domain ontology, e.g.
categories.
+* Provide repository structures as thesaurus or domain ontology, e.g.
categories.
## Technical Documentation
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext?rev=1179530&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext
(added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext
Thu Oct 6 08:03:12 2011
@@ -0,0 +1,80 @@
+Title: The Language Identification Engine: detect the language of an text
+
+The **LangId** engine determines the language of text.
+
+## Technical Description
+
+The provided engine is based on the language identifier of [Apache
Tika](http://tika.apache.org/).
+The text to be checked must be provided in plain text format in one of two
forms:
+
+* a plain text content item
+* by the content item's metadata as the string value of the property
+
+
<pre><code>http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent</pre></code>
+
+The result of language identification is added as TextAnnotation to the
content item's metadata as string value of the property
+
+ http://purl.org/dc/terms/language
+
+This RDF snippet illustrates the output:
+
+ <fise:TextAnnotation
rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
+ <dc:language>en</dc:language>
+
<dc:creator>org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine</dc:creator>
+ </fise:TextAnnotation>
+
+
+By default the language identifier distinguishes the languages listed below.
After the colon the value of the language label in the metadata is given.
+
+* German: de
+* English: en
+* Estonian: et
+* French: fr
+* Spanish: es
+* Italian: it
+* Swedish: sv
+* Polish: pl
+* Dutch: nl
+* Norwegian: no
+* Finnish: fi
+* Greek: el
+* Danish: da
+* Hungarian: hu
+* Icelandic: is
+* Lithuanian: lt
+* Portuguese: pt
+* Russian: ru
+* Thai: th
+
+Additional language models can be created as Tika
[LanguageProfile](org.apache.tika.language.LanguageProfile).
+
+## Configuration options
+
+* <code>org.apache.stanbol.enhancer.engines.langid.probe-length</code>
+
+ an integer specifying how many characters will be used for
+ identification. A value of 0 or below means to use the complete
+ text. Otherwise only a substring of the specified length taken from the
+ middle of the text will be used. The default value is 400 characters.
+
+## Usage
+
+Assuming that the Stanbol endpoint with the full launcher is running at
+
+ http://localhost:8080
+
+and the engine is activated, from the command line commands like this
+can be used for submitting some text file as content item:
+
+* stateless interface
+
+ curl -i -X POST -H "Content-Type:text/plain" -T testfile.txt
http://localhost:8080/engines
+
+* stateful interface
+
+ curl -i -X PUT -H "Content-Type:text/plain" -T testfile.txt
http://localhost:8080/contenthub/content/someFileId
+
+ Alternatively, the Stanbol web interface can be used for submitting documents
+and viewing the metadata at
+
+ http://localhost:8080/contenthub
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext?rev=1179530&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext
(added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext
Thu Oct 6 08:03:12 2011
@@ -0,0 +1,281 @@
+Title: The Metaxa Enhancement Engine: extracting content and metadata from
various formats
+
+The **Metaxa Enhancement Engine** extracts embedded metadata and textual
content from a large variety of document types and formats. The text extraction
functionality also makes Metaxa suitable as a pre-processor for other
components, especially NLP processors and indexing for search.
+
+## Technical description
+
+The engine is based on the [Aperture
+framework](http://aperture.sourceforge.net/) with new extensions to handling
structured content embedded in HTML web content, such as
[Microformats](http://microformats.org/) and
[RDFa](http://www.w3.org/TR/rdfa-syntax/).
+Also some of the original extractors of Aperture were replaced by other
engines using different base libraries.
+Metaxa introduces a single TextEnhancement instance that refers to the content
item by its *extracted-from* property. The specific metadata extracted by
Metaxa are ascribed directly to the content item/document since they represent
+document properties and not text annotations. Various ontologies are employed
to describe various types of metadata. An overview will be given below.
+
+The general structure of the Metaxa annotations consists of three levels of
annotations illustrated in the following example:
+
+#### The top-level <tt>TextAnnotation</tt> instance
+
+ <urn:enhancement-03c9e85e-2681-21b7-a5af-6da62d67ef6b>
+ a <http://fise.iks-project.eu/ontology/TextAnnotation> ,
+ <http://fise.iks-project.eu/ontology/Enhancement> ;
+ <http://fise.iks-project.eu/ontology/confidence>
+ "1.0"^^<http://www.w3.org/2001/XMLSchema#double> ;
+ <http://fise.iks-project.eu/ontology/extracted-from>
+ <http://localhost:8080/store/content/mf_example.htm> ;
+ <http://purl.org/dc/terms/created>
+
"2010-09-22T09:06:53.056+02:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
+ <http://purl.org/dc/terms/creator>
+
"org.apache.enhancer.engines.metaxa.MetaxaEngine"^^<http://www.w3.org/2001/XMLSchema#string>
.
+
+
+#### The top-level document metadata, referenced from the
<tt>TextAnnotation</tt> instance via the *extracted-from* property:
+
+ <http://localhost:8080/store/content/mf_example.htm>
+ a
<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument> ;
+ <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains>
+ <urn:rnd:-9e25553:12b3843df43:-7ffe> ;
+ <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#description>
+ "Cheap Flights to Tenerife, Arrecife, Paphos, Mahon, Las
Palmas, Malaga, Alicante, Faro, Heraklion, Palma and the rest of the World.
Flightline searches over 100 Airlines and 30,000 Hotels. ABTA, IATA, ATOL
Bonded." ;
+ <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#keyword>
+ "travel" , "bargain flights" , "late deals" , "hotels" , "air
tickets" , "air fares" , "discount travel" , "last minute flights" , "cheap
airlines" , "cheap holidays" , "cheap flights" , "flightline" , "hotel
reservations" , "discount flights" , "air travel" , "package holidays" ;
+
<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent>
+ "More Than Just Cheap Flights ..." ;
+ <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title>
+ "Flightline | Cheap Flights, Package Holidays, Hotels, Travel
Insurance & More" .
+
+#### Embedded <tt>hCard</tt> microformat data referenced via the
<tt>nie:contains</tt> property:
+
+
+ <urn:rnd:-9e25553:12b3843df43:-7ffe>
+ a <http://www.w3.org/2006/vcard/ns#VCard> ;
+ <http://www.w3.org/2006/vcard/ns#adr>
+ <urn:rnd:-9e25553:12b3843df43:-7ffc> ;
+ <http://www.w3.org/2006/vcard/ns#fn>
+ "Flightgeoline Essex Limited" ;
+ <http://www.w3.org/2006/vcard/ns#geo>
+ <urn:rnd:-9e25553:12b3843df43:-7ffb> ;
+ <http://www.w3.org/2006/vcard/ns#org>
+ <urn:rnd:-9e25553:12b3843df43:-7ffd> ;
+ <http://www.w3.org/2006/vcard/ns#photo>
+
<https://www.flightline.co.uk/common/images/building_banner_sm.jpg> ;
+ <http://www.w3.org/2006/vcard/ns#url>
+ <http://www.flightline.co.uk> ;
+ <http://www.w3.org/2006/vcard/ns#workTel>
+ <tel:0800541541> .
+
+ <urn:rnd:-9e25553:12b3843df43:-7ffd>
+ a <http://www.w3.org/2006/vcard/ns#Organization> ;
+ <http://www.w3.org/2006/vcard/ns#organization-name>
+ "Flightline Essex Limited" .
+
+ <urn:rnd:-9e25553:12b3843df43:-7ffc>
+ a <http://www.w3.org/2006/vcard/ns#Address> ;
+ <http://www.w3.org/2006/vcard/ns#countryName>
+ "UK" ;
+ <http://www.w3.org/2006/vcard/ns#extendedAddress>
+ "Flightline House" ;
+ <http://www.w3.org/2006/vcard/ns#locality>
+ "Westcliff-on-Sea" ;
+ <http://www.w3.org/2006/vcard/ns#postalCode>
+ "SS0 7JE" ;
+ <http://www.w3.org/2006/vcard/ns#region>
+ "Essex" ;
+ <http://www.w3.org/2006/vcard/ns#streetAddress>
+ "32-38 Milton Road" .
+
+ <urn:rnd:-9e25553:12b3843df43:-7ffb>
+ a <http://www.w3.org/2006/vcard/ns#Location> ;
+ <http://www.w3.org/2006/vcard/ns#latitude>
+ "51.53894902845868" ;
+ <http://www.w3.org/2006/vcard/ns#longitude>
+ "0.700753927230835" .
+
+
+
+### Supported document types
+
+The set of extraction engines for specific document types is defined by the
resource *extractionregistry.xml*. Each engine specifies what MIME types it can
handle. By default the extraction registry provides extractors for the
+following set of document formats:
+
+* *Office documents*:
+ * MS-Works
+ * MS-Office
+ * Excel
+ * PowerPoint
+ * Word
+ * Visio
+ * OpenDocument
+ * OpenXml
+ * Publisher
+ * Corel-Presentations
+ * QuattroPro
+ * WordPerfect
+
+* *Multimedia documents*:
+ * JPG
+ * MP3
+
+* *(X)HTML*, supporting also these types of embedded structures/microformats,
as defined by the default resource *htmlextractors.xml*:
+ * RDFa
+ * geo
+ * hAtom
+ * hCal
+ * hCard
+ * hReview
+ * rel-license
+ * rel-tag
+ * xFolk
+
+* *Other*:
+ * PDF
+ * RTF
+ * Plain Text
+ * XML
+
+### Textual Content
+
+Metaxa represents the plain text content of a document in the content item's
metadata as the value of the property:
+
+ http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent
+
+### Vocabularies
+
+Metaxa uses a set of vocabularies ("ontologies") for structured data
representation.
+
+#### Aperture Core Ontologies
+
+These ontologies belong to the underlying Aperture subsystem, contained in the
+package
+
+ org.semanticdesktop.aperture.vocabulary
+
+The most important ones with respect to top-level document properties are
+
+* NIE (Nepomuk Information Element):
+
+ http://www.semanticdesktop.org/ontologies/2007/01/19/nie#
+
+* NFO (Nepomuk File Object):
+
+ http://www.semanticdesktop.org/ontologies/2007/01/19/nfo#
+
+Documentation of Aperture's core ontologies is provided in Aperture's Javadoc
[http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html](http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html)
for the packages in
+
+ org.semanticdesktop.aperture.vocabulary.
+
+#### HTML Microformat Extractors
+
+The following table describes which vocabularies are used for representing
microformat data in Metaxa:
+
+
+<table border="1">
+ <tr>
+ <th>MF</th>
+ <th>Vocabulary (Namespace)</th>
+ </tr>
+ <tr>
+ <td>geo</td>
+ <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
+ </tr>
+ <tr>
+ <td>hAtom</td>
+ <td>atom (<tt>http://www.w3.org/2005/Atom#)</td>
+ </tr>
+ <tr>
+ <td/>
+ <td>tagging
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+ </tr>
+ <tr>
+ <td>hCal</td>
+ <td> ical (<tt>http://www.w3.org/2002/12/cal/icaltzd#</tt>)</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+ </tr>
+ <tr>
+ <td>hCard</td>
+ <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+ </tr>
+ <tr>
+ <td>hReview</td>
+ <td>review (<tt>http://www.purl.org/stuff/rev#</tt>)</td></tr>
+ <tr>
+ <td></td>
+ <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>dc (<tt>http://purl.org/dc/elements/1.1/</tt>)</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>dcterms (<tt>http://purl.org/dc/dcmitype/</tt>)</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>foaf (<tt>http://xmlns.com/foaf/0.1/</tt>)</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>tag (<tt>http://www.holygoat.co.uk/owl/redwood/0.1/tags/</tt>)</td>
+ </tr>
+ <tr>
+ <td>rel-license</td>
+ <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
+ </tr>
+ <tr>
+ <td>rel-tag</td>
+ <td> tagging
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+ </tr>
+ <tr>
+ <td>xFolk</td>
+ <td>nfo
(<tt>http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#</tt>)</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>tagging
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+ </tr>
+</table>
+
+## Configuration options
+
+By default, Metaxa uses the extractors specified in the resource
"extractionregistry.xml", and for HTML pages, the resource "htmlregistry.xml".
+Alternative configurations and extractors can be attached to Metaxa as
fragment bundles, specifying as host bundle
+
+ Fragment-Host: org.apache.stanbol.enhancer.engines.metaxa
+
+The alternative configuration files then can be set as values of the properties
+
+*
<pre><code>org.apache.stanbol.enhancer.engines.metaxa.extractionregistry</pre></code>
+
+*
<pre><code>org.apache.stanbol.enhancer.engines.metaxa.htmlextractors</pre></code>
+
+## Usage
+
+Assuming that the Stanbol endpoint with the full launcher is running at
+
+ http://localhost:8080
+
+and the engine is activated, from the command line commands like this can be
used for submitting some file as content item, where the mime type must match
the document type:
+
+* stateless interface
+
+ curl -i -X POST -H "Content-Type:text/html" -T testpage.html
http://localhost:8080/engines
+
+* stateful interface
+
+ curl -i -X PUT -H "Content-Type:text/html" -T testpage.html
http://localhost:8080/contenthub/content/someFileId
+
+ Alternatively, the Stanbol web interface can be used for submitting documents
+and viewing the metadata at
+
+ http://localhost:8080/contenthub
+
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext?rev=1179530&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext
(added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext
Thu Oct 6 08:03:12 2011
@@ -0,0 +1,64 @@
+Title: Configure Apache Stanbol to work with multiple languages
+
+
+The following languages are supported -
+
+- English
+- German
+- Danish
+- Swedish
+- Dutch
+- Portuguese
+
+
+##Configuration steps
+
+- Have language labels in your target data and install the index
+- Activate the LangIdEnhancementEngine and the KeywordLinkingEngine
+- Add language models to your Stanbol instance
+- Configure the KeywordLinkingEngine
+
+
+###Install your index
+
+In case you want to use an index of your custom vocabulary, first [create an
index](customvocabulary.html) out of it and then add the index to your stanbol
instance. Simply paste the <code>{yourindex}.solr.zip</code> into your
<code>{stanbol-root}/sling/datafiles</code> directory and install the
respective OSGI bundle at your OSGI admin console.
+
+Make sure, that this index contains language labels in all languages you want
to work with and that they are properly indexed.
+
+###Build and add the necessary language bundles
+
+To build the language bundles go to "{stanbol-root}/data/" and call
+
+ mvn clean install -P opennlp
+
+This enables the profile to build the OpenNLP models for all languages.
+
+After this the bundles are available in the folder
+
+ {stanbol-root}/data/opennlp/lang/{language}/target
+
+The naming of the bundles is
"org.apache.stanbol.data.opennlp.lang.{language}-*.jar".
+
+Add the bundle via the OSGI admin console in the bundles tab. The language
bundles will fetch and install the according
[OpenNLP](http://dev.iks-project.eu/downloads/opennlp/models-1.5/) models for
the languages you want to use.
+
+OpenNLP provides language support
+
+
+
+###Activate the LangID engine and the KeywordLinkingEngine
+
+Go to the admin console and deactivate some of the available engines.
Especially the standard NER engine and the Entity Linking Engines should be
deactivated, as they do not support multiple languages. At least two engines
need to be activated:
+
+- The [Language Identification Engine](enhancer/engines/langidengine.html)
provides you with the language of the text you want to enhance, it creates a
dc:terms languaage property. The
+- The [Keyword Linking Engine](enhancer/engines/keywordlinkingengine.html)
+
+
+
+###Configure the KeywordLinkingEngine
+
+(TODO)
+
+
+##Examples
+
+(TODO)
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext?rev=1179530&view=auto
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext
(added)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext
Thu Oct 6 08:03:12 2011
@@ -0,0 +1,29 @@
+Title: The Named Entity Recognition Engine: detect Named Entities from
unstructured text content
+
+This engine is based on the NLP features of [Apache OpenNLP
(incubating)](http://incubator.apache.org/opennlp/). It uses its Maximum
Entropy models to detect Persons, Names and Organizations.
+
+(TODO: features, configuration if possible)
+
+
+## Example Result
+
+This engine adds **TextAnnotation-Enhancements** for the text "John Smith
lives in London", (amongst other) the following information to the enhancement
graph, suggesting London (of type: Place) for the string London:
+
+ {
+ "@subject": "<urn:enhancement-e6a08398-a49f-5bf6-c09f-6da5db63507e>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:TextAnnotation>"
+ ],
+ "dc:created": "2011-10-04T12:36:50.670Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.opennlp.impl.NEREngineCore",
+ "dc:type": "<dbp-ont:Place>",
+ "enhancer:confidence": 0.99691045,
+ "enhancer:end": 26,
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>",
+ "enhancer:selected-text": "London",
+ "enhancer:selection-context": "John Smith lives in London",
+ "enhancer:start": 20
+ }
+
+This enhancement statement provides you with the ID and date of the
enhancement, the suggested type with a confidence for it, the position of the
selected text and its (sentence) context as well as the link to the source
document.
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext?rev=1179530&view=auto
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext
(added)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext
Thu Oct 6 08:03:12 2011
@@ -0,0 +1,165 @@
+Title: The Named Entity Tagging Engine: linking text annotations to (external)
datasets of entities
+
+The Entity Linking Engine uses *[Referenced Sites](../../entityhub.html)* to
search for Entities based on given Text Annotations.
+
+## Configuration
+
+The configuration decides, which dataset you want to use as linking target.
The default value is "local" referencing to the default DBpedia index. You may
also decide on whether given types should restrict the set of possible links.
E.g. for DBpedia, some organisations are not tagged as such, therefore, you
want get them with this engine although, you expect them from your dataset.
+
+- Referenced Site: {local, your referenced site}
+
+ *The ID of the Entityhub Referenced Site used for semantic lifting of
TextAnnotations.*
+
+- Persons: {true, false}
+
+ *Set to TRUE to enable semantic lifting of Persons*
+
+- Person Type {<empty>, dbp-ont:Person}
+
+ *The rdf:type used to search for Persons. If empty Entities of any type are
accepted.*
+
+- Organisations {true, false}
+
+ *Set to TRUE to enable semantic lifting of Organisations*
+
+- Organisation Type {<empty>, dbp-ont:Organisation}
+
+ *The rdf:type used to search for Organizations. If empty Entities of any
type are accepted.*
+
+- Places {true, false}
+
+ *Set to TRUE to enable semantic lifting of Places*
+
+- Place Type {<empty>, dbp-ont:Place}
+
+ *The rdf:type used to search for Places. If empty Entities of any type are
accepted.*
+
+- Label Field {<empty>, rdfs:label}
+
+ *The field used to search for Entities with a label similar to the selected
text of the Text Annotation. If empty rdfs:label is used as default*
+
+
+## Example Result
+
+For the sentence "John Smith lives in London", you will get several
EntityAnnotations for the terms "London", "John Smith" form your linking target
resource (in this case DBpedia) together with a confidence value, which can be
used to sort the suggestions.
+
+ {
+ "@subject": "<urn:enhancement-2ec0662c-3a10-f8f5-43b4-cf7403e4c39d>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:EntityAnnotation>"
+ ],
+ "dc:created": "2011-10-06T07:45:04.175Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+ "dc:relation": "<urn:enhancement-0218c6fa-7376-8c9f-c4ed-e973ff72194c>",
+ "enhancer:confidence": 5147829.5,
+ "enhancer:entity-label": "\"London\"@en",
+ "enhancer:entity-reference": "<http://dbpedia.org/resource/London>",
+ "enhancer:entity-type": "<owl:Thing>",
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+ },
+ {
+ "@subject": "<urn:enhancement-44ccea73-639d-394a-8660-fad46795a772>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:EntityAnnotation>"
+ ],
+ "dc:created": "2011-10-06T07:45:06.809Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+ "dc:relation": "<urn:enhancement-4b7b010e-efcc-8752-f055-b73620270295>",
+ "enhancer:confidence": 4.471743,
+ "enhancer:entity-label": "\"John L. Smith\"@en",
+ "enhancer:entity-reference":
"<http://dbpedia.org/resource/John_L._Smith>",
+ "enhancer:entity-type": "<dbp-ont:CollegeCoach>",
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+ },
+ {
+ "@subject": "<urn:enhancement-4b7b010e-efcc-8752-f055-b73620270295>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:TextAnnotation>"
+ ],
+ "dc:created": "2011-10-06T07:44:52.318Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.opennlp.impl.NEREngineCore",
+ "dc:type": "<dbp-ont:Person>",
+ "enhancer:confidence": 0.66891855,
+ "enhancer:end": 10,
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>",
+ "enhancer:selected-text": "John Smith",
+ "enhancer:selection-context": "John Smith lives in London",
+ "enhancer:start": 0
+ },
+ {
+ "@subject": "<urn:enhancement-708bfdae-c104-19bd-c423-f5c10a11ae55>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:EntityAnnotation>"
+ ],
+ "dc:created": "2011-10-06T07:45:04.216Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+ "dc:relation": "<urn:enhancement-0218c6fa-7376-8c9f-c4ed-e973ff72194c>",
+ "enhancer:confidence": 2543.5994,
+ "enhancer:entity-label": "\"London, Ontario\"@en",
+ "enhancer:entity-reference":
"<http://dbpedia.org/resource/London,_Ontario>",
+ "enhancer:entity-type": "<http://www.opengis.net/gml/_Feature>",
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+ },
+ {
+ "@subject": "<urn:enhancement-73dce2ac-72b6-b0f4-7c5c-e9c30aec9263>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:EntityAnnotation>"
+ ],
+ "dc:created": "2011-10-06T07:45:04.216Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+ "dc:relation": "<urn:enhancement-0218c6fa-7376-8c9f-c4ed-e973ff72194c>",
+ "enhancer:confidence": 7709.837,
+ "enhancer:entity-label": "\"City of London\"@en",
+ "enhancer:entity-reference":
"<http://dbpedia.org/resource/City_of_London>",
+ "enhancer:entity-type": "<http://www.opengis.net/gml/_Feature>",
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+ },
+ {
+ "@subject": "<urn:enhancement-c428cb67-cdce-4396-96b8-ac3a8465730a>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:TextAnnotation>"
+ ],
+ "dc:created": "2011-10-06T07:44:39.064Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine",
+ "dc:language": "\"fi\"",
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+ },
+ {
+ "@subject": "<urn:enhancement-c6ffb5f4-a224-9b7d-9854-7eaa101b2ebe>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:EntityAnnotation>"
+ ],
+ "dc:created": "2011-10-06T07:45:06.809Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+ "dc:relation": "<urn:enhancement-4b7b010e-efcc-8752-f055-b73620270295>",
+ "enhancer:confidence": 15.735652,
+ "enhancer:entity-label": "\"John Maynard Smith\"@en",
+ "enhancer:entity-reference":
"<http://dbpedia.org/resource/John_Maynard_Smith>",
+ "enhancer:entity-type": "<dbp-ont:Scientist>",
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+ },
+ {
+ "@subject": "<urn:enhancement-eeaf0331-5988-5231-493c-f934a2602200>",
+ "@type": [
+ "<enhancer:Enhancement>",
+ "<enhancer:EntityAnnotation>"
+ ],
+ "dc:created": "2011-10-06T07:45:06.809Z",
+ "dc:creator":
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+ "dc:relation": "<urn:enhancement-4b7b010e-efcc-8752-f055-b73620270295>",
+ "enhancer:confidence": 4.4515367,
+ "enhancer:entity-label": "\"John T. Smith\"@en",
+ "enhancer:entity-reference":
"<http://dbpedia.org/resource/John_T._Smith>",
+ "enhancer:entity-type": "<owl:Thing>",
+ "enhancer:extracted-from":
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+ }
+ ]
+ }
+