Author: agruber
Date: Thu Oct  6 08:03:12 2011
New Revision: 1179530

URL: http://svn.apache.org/viewvc?rev=1179530&view=rev
Log:
Add documentation on several engines: langID, Metaxa, NER, NET and multilingual 
case description

Added:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext
Modified:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext?rev=1179530&r1=1179529&r2=1179530&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext 
(original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/engines.mdtext Thu 
Oct  6 08:03:12 2011
@@ -2,17 +2,18 @@ Title: Enhancement Engines and their mai
 
 ## Preprocessing
 
-- __Language Identification Engine__
-       - langage dedection for textual content using Apache Tika
+- __[Language Identification Engine](enhancer/engines/langidengine.html)__
+       - language detection for textual content utilizing [Apache 
Tika](http://tika.apache.org/)
        
 
-- __Metaxa Engine__
-       - text extraction from various documents
-       - extraction of metadata from documents
+- __[Metaxa Engine](enhancer/engines/metaxaengine.html)__
+       - text extraction from various document formats
+       - extraction of metadata from document formats
+       -
        
 ## Natural Language Processing
 
-- __Named Entity Extraction Enhancement Engine__ 
+- __[Named Entity Extraction Enhancement 
Engine](enhancer/engines/namedentityextractionengine.html)__ 
        - NLP processing using OpenNLP NER
        - detect occurrences of persons, places and organizations only
        
@@ -21,6 +22,7 @@ Title: Enhancement Engines and their mai
        - NLP processing using OpenNLP
        - supports multiple languages
        - dedect occurences of untyped entities as concepts, takes local 
taxonomies as linking target
+
        
 - _Taxonomy Linking Engine_ (deprecated, see KeywordLinkingEngine)
        - NLP processing using OpenNLP POS
@@ -29,7 +31,7 @@ Title: Enhancement Engines and their mai
 
 ## Linking Suggestions
 
-- __Named Entity Tagging Engine__
+- __[Named Entity Tagging 
Engine](enhancer/engines/namedentitytaggingengine.html)__
        - suggest links to several Linked Data Sources (e.g. dbpedia)
 
 - __Location Enhancement Engine__ 
@@ -45,7 +47,7 @@ Title: Enhancement Engines and their mai
 
 ## Postprocessing / Other
 
-- __CachingDereferencerEngine__ 
+- _CachingDereferencerEngine_ (deprecated, see dereferencing support of 
individual engines as well as  
[STANBOL-336](https://issues.apache.org/jira/browse/STANBOL-336))
        - retrieves additional content for presenting the enhancement results.
        
 - __Refactor Engine__

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext?rev=1179530&r1=1179529&r2=1179530&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext 
(original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/index.mdtext Thu 
Oct  6 08:03:12 2011
@@ -49,13 +49,17 @@ The web interface of your Apache Stanbol
 
 ## Usage Scenarios for Apache Stanbol
 
-* [Content Enhancement](contentenhancement.html)
+* [Basic Content Enhancement](contentenhancement.html)
 
  Analyze textual content, enhance with with named entities (person, place, 
organization), suggest links to open data sources.
 
 * [Working with "local" Entities](customvocabulary.html)
 
- Use locally defined entities (e.g. thesaurus concepts) from an organization's 
context.  
+ Use locally defined entities (e.g. thesaurus concepts) from an organization's 
context.
+
+* [Working with multiple languages](multilingual.html)
+ 
+ Get enhancements for textual content in multiple languages (EN, DE, SV, DA, 
PT and NL).
 
 * Semantic Search in Portals
 
@@ -66,7 +70,7 @@ The web interface of your Apache Stanbol
  Refactor the enhancement result, its property names and ontology types 
according your target ontology.
 
 * Transforming CMS repository structures into ontologies
- Provide repository structures as thesaurus or domain ontology, e.g. 
categories.
+* Provide repository structures as thesaurus or domain ontology, e.g. 
categories.
 
 
 ## Technical Documentation

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext?rev=1179530&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext 
(added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/langidengine.mdtext 
Thu Oct  6 08:03:12 2011
@@ -0,0 +1,80 @@
+Title: The Language Identification Engine: detect the language of an text
+
+The **LangId** engine determines the language of text.
+
+## Technical Description
+
+The provided engine is based on the language identifier of [Apache 
Tika](http://tika.apache.org/).
+The text to be checked must be provided in plain text format in one of two 
forms:
+
+* a plain text content item
+* by the content item's metadata as the string value of the property 
+    
+    
<pre><code>http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent</pre></code>
+
+The result of language identification is added as TextAnnotation to the 
content item's metadata as string value of the property
+
+    http://purl.org/dc/terms/language
+
+This RDF snippet illustrates the output:
+
+    <fise:TextAnnotation 
rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
+        <dc:language>en</dc:language>
+        
<dc:creator>org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine</dc:creator>
+    </fise:TextAnnotation>
+
+
+By default the language identifier distinguishes the languages listed below. 
After the colon the value of the language label in the metadata is given.
+
+* German: de
+* English: en
+* Estonian: et
+* French: fr
+* Spanish: es
+* Italian: it
+* Swedish: sv
+* Polish: pl
+* Dutch: nl
+* Norwegian: no
+* Finnish: fi
+* Greek: el
+* Danish: da
+* Hungarian: hu
+* Icelandic: is
+* Lithuanian: lt
+* Portuguese: pt
+* Russian: ru
+* Thai: th
+
+Additional language models can be created as Tika 
[LanguageProfile](org.apache.tika.language.LanguageProfile).
+
+## Configuration options
+
+* <code>org.apache.stanbol.enhancer.engines.langid.probe-length</code>
+
+    an integer specifying how many characters will be used for
+    identification. A value of 0 or below means to use the complete
+    text. Otherwise only a substring of the specified length taken from the
+    middle of the text will be used. The default value is 400 characters.
+
+## Usage
+
+Assuming that the Stanbol endpoint with the full launcher is running at
+
+    http://localhost:8080
+
+and the engine is activated, from the command line commands like this
+can be used for submitting some text file as content item:
+
+* stateless interface
+
+    curl -i -X POST -H "Content-Type:text/plain" -T testfile.txt 
http://localhost:8080/engines
+
+* stateful interface
+
+    curl -i -X PUT -H "Content-Type:text/plain" -T testfile.txt 
http://localhost:8080/contenthub/content/someFileId
+
+ Alternatively, the Stanbol web interface can be used for submitting documents
+and viewing the metadata at
+
+    http://localhost:8080/contenthub

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext?rev=1179530&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext 
(added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/metaxaengine.mdtext 
Thu Oct  6 08:03:12 2011
@@ -0,0 +1,281 @@
+Title: The Metaxa Enhancement Engine: extracting content and metadata from 
various formats
+
+The **Metaxa Enhancement Engine** extracts embedded metadata and textual 
content from a large variety of document types and formats. The text extraction 
functionality also makes Metaxa suitable as a pre-processor for other 
components, especially NLP processors and indexing for search.
+
+## Technical description
+
+The engine is based on the [Aperture
+framework](http://aperture.sourceforge.net/) with new extensions to handling 
structured content embedded in HTML web content, such as 
[Microformats](http://microformats.org/) and 
[RDFa](http://www.w3.org/TR/rdfa-syntax/).
+Also some of the original extractors of Aperture were replaced by other 
engines using different base libraries.
+Metaxa introduces a single TextEnhancement instance that refers to the content 
item by its *extracted-from* property. The specific metadata extracted by 
Metaxa are ascribed directly to the content item/document since they represent
+document properties and not text annotations. Various ontologies are employed 
to describe various types of metadata. An overview will be given below.
+
+The general structure of the Metaxa annotations consists of three levels of 
annotations illustrated in the following example:
+
+#### The top-level <tt>TextAnnotation</tt> instance
+
+    <urn:enhancement-03c9e85e-2681-21b7-a5af-6da62d67ef6b>
+         a       <http://fise.iks-project.eu/ontology/TextAnnotation> ,
+                        <http://fise.iks-project.eu/ontology/Enhancement> ;
+                 <http://fise.iks-project.eu/ontology/confidence>
+                     "1.0"^^<http://www.w3.org/2001/XMLSchema#double> ;
+         <http://fise.iks-project.eu/ontology/extracted-from>
+                 <http://localhost:8080/store/content/mf_example.htm> ;
+         <http://purl.org/dc/terms/created>
+                 
"2010-09-22T09:06:53.056+02:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
+         <http://purl.org/dc/terms/creator>
+                  
"org.apache.enhancer.engines.metaxa.MetaxaEngine"^^<http://www.w3.org/2001/XMLSchema#string>
 .
+
+
+#### The top-level document metadata, referenced from the 
<tt>TextAnnotation</tt> instance via the *extracted-from* property:
+
+    <http://localhost:8080/store/content/mf_example.htm>
+         a       
<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument> ;
+         <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains>
+                 <urn:rnd:-9e25553:12b3843df43:-7ffe> ;
+         <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#description>
+                 "Cheap Flights to Tenerife, Arrecife, Paphos, Mahon, Las 
Palmas, Malaga, Alicante, Faro, Heraklion, Palma and the rest of the World. 
Flightline searches over 100 Airlines and 30,000 Hotels. ABTA, IATA, ATOL 
Bonded." ;
+         <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#keyword>
+                 "travel" , "bargain flights" , "late deals" , "hotels" , "air 
tickets" , "air fares" , "discount travel" , "last minute flights" , "cheap 
airlines" , "cheap holidays" , "cheap flights" , "flightline" , "hotel 
reservations" , "discount flights" , "air travel" , "package holidays" ;
+         
<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent>
+                 "More Than Just Cheap Flights ..." ;
+         <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title>
+                 "Flightline | Cheap Flights, Package Holidays, Hotels, Travel 
Insurance &amp; More" .
+
+#### Embedded <tt>hCard</tt> microformat data referenced via the 
<tt>nie:contains</tt> property:
+
+
+    <urn:rnd:-9e25553:12b3843df43:-7ffe>
+         a       <http://www.w3.org/2006/vcard/ns#VCard> ;
+         <http://www.w3.org/2006/vcard/ns#adr>
+               <urn:rnd:-9e25553:12b3843df43:-7ffc> ;
+         <http://www.w3.org/2006/vcard/ns#fn>
+               "Flightgeoline Essex Limited" ;
+         <http://www.w3.org/2006/vcard/ns#geo>
+               <urn:rnd:-9e25553:12b3843df43:-7ffb> ;
+        <http://www.w3.org/2006/vcard/ns#org>
+               <urn:rnd:-9e25553:12b3843df43:-7ffd> ;
+        <http://www.w3.org/2006/vcard/ns#photo>
+               
<https://www.flightline.co.uk/common/images/building_banner_sm.jpg> ;
+        <http://www.w3.org/2006/vcard/ns#url>
+               <http://www.flightline.co.uk> ;
+        <http://www.w3.org/2006/vcard/ns#workTel>
+               <tel:0800541541> .
+
+    <urn:rnd:-9e25553:12b3843df43:-7ffd>
+         a       <http://www.w3.org/2006/vcard/ns#Organization> ;
+         <http://www.w3.org/2006/vcard/ns#organization-name>
+               "Flightline Essex Limited" .
+
+    <urn:rnd:-9e25553:12b3843df43:-7ffc>
+         a       <http://www.w3.org/2006/vcard/ns#Address> ;
+         <http://www.w3.org/2006/vcard/ns#countryName>
+               "UK" ;
+         <http://www.w3.org/2006/vcard/ns#extendedAddress>
+              "Flightline House" ;
+         <http://www.w3.org/2006/vcard/ns#locality>
+              "Westcliff-on-Sea" ;
+         <http://www.w3.org/2006/vcard/ns#postalCode>
+              "SS0 7JE" ;
+         <http://www.w3.org/2006/vcard/ns#region>
+              "Essex" ;
+         <http://www.w3.org/2006/vcard/ns#streetAddress>
+              "32-38 Milton Road" .
+
+    <urn:rnd:-9e25553:12b3843df43:-7ffb>
+         a       <http://www.w3.org/2006/vcard/ns#Location> ;
+         <http://www.w3.org/2006/vcard/ns#latitude>
+              "51.53894902845868" ;
+         <http://www.w3.org/2006/vcard/ns#longitude>
+              "0.700753927230835" .
+
+
+
+### Supported document types
+
+The set of extraction engines for specific document types is defined by the 
resource *extractionregistry.xml*. Each engine specifies what MIME types it can 
handle. By default the extraction registry provides extractors for the
+following set of document formats:
+
+* *Office documents*:
+ *   MS-Works
+ *   MS-Office
+ *   Excel
+ *   PowerPoint
+ *   Word
+ *   Visio
+ *   OpenDocument
+ *   OpenXml
+ *   Publisher
+ *   Corel-Presentations
+ *   QuattroPro
+ *   WordPerfect
+
+* *Multimedia documents*:
+ *    JPG
+ *    MP3
+
+* *(X)HTML*, supporting also these types of embedded structures/microformats, 
as defined by the default resource *htmlextractors.xml*:
+ *    RDFa
+ *    geo
+ *    hAtom
+ *    hCal
+ *    hCard
+ *    hReview
+ *    rel-license
+ *    rel-tag
+ *    xFolk
+
+* *Other*:
+ *    PDF
+ *    RTF
+ *    Plain Text
+ *    XML
+  
+### Textual Content
+
+Metaxa represents the plain text content of a document in the content item's 
metadata as the value of the property:
+
+    http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent
+
+### Vocabularies
+
+Metaxa uses a set of vocabularies ("ontologies") for structured data 
representation.
+
+#### Aperture Core Ontologies
+
+These ontologies belong to the underlying Aperture subsystem, contained in the
+package
+
+    org.semanticdesktop.aperture.vocabulary
+
+The most important ones with respect to top-level document properties are
+
+* NIE (Nepomuk Information Element):
+
+    http://www.semanticdesktop.org/ontologies/2007/01/19/nie#
+
+* NFO (Nepomuk File Object):
+
+    http://www.semanticdesktop.org/ontologies/2007/01/19/nfo# 
+
+Documentation of Aperture's core ontologies is provided in Aperture's Javadoc 
[http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html](http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html)
 for the packages in 
+
+    org.semanticdesktop.aperture.vocabulary.
+
+#### HTML Microformat Extractors
+
+The following table describes which vocabularies are used for representing 
microformat data in Metaxa: 
+
+
+<table border="1">
+    <tr>
+        <th>MF</th>
+        <th>Vocabulary (Namespace)</th>
+    </tr>
+    <tr>
+        <td>geo</td>
+        <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
+    </tr>
+    <tr>
+        <td>hAtom</td>
+        <td>atom (<tt>http://www.w3.org/2005/Atom#)</td>
+    </tr>
+    <tr>
+    <td/>
+        <td>tagging 
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+    </tr>
+    <tr>
+        <td>hCal</td>
+        <td> ical (<tt>http://www.w3.org/2002/12/cal/icaltzd#</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+    </tr>
+    <tr>
+        <td>hCard</td>
+        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+    </tr>
+    <tr>
+        <td>hReview</td>
+        <td>review (<tt>http://www.purl.org/stuff/rev#</tt>)</td></tr>
+    <tr>
+        <td></td>
+        <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>dc (<tt>http://purl.org/dc/elements/1.1/</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>dcterms (<tt>http://purl.org/dc/dcmitype/</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>foaf (<tt>http://xmlns.com/foaf/0.1/</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>tag (<tt>http://www.holygoat.co.uk/owl/redwood/0.1/tags/</tt>)</td>
+    </tr>
+    <tr>
+        <td>rel-license</td>
+        <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
+    </tr>
+    <tr>
+        <td>rel-tag</td>
+        <td> tagging 
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+    </tr>
+    <tr>
+        <td>xFolk</td>
+        <td>nfo 
(<tt>http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>tagging 
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+    </tr>
+</table>
+
+## Configuration options
+
+By default, Metaxa uses the extractors specified in the resource 
"extractionregistry.xml", and for HTML pages, the resource "htmlregistry.xml".
+Alternative configurations and extractors can be attached to Metaxa as 
fragment bundles, specifying as host bundle
+
+    Fragment-Host: org.apache.stanbol.enhancer.engines.metaxa
+
+The alternative configuration files then can be set as values of the properties
+
+* 
<pre><code>org.apache.stanbol.enhancer.engines.metaxa.extractionregistry</pre></code>
+
+* 
<pre><code>org.apache.stanbol.enhancer.engines.metaxa.htmlextractors</pre></code>
+
+## Usage
+
+Assuming that the Stanbol endpoint with the full launcher is running at
+
+    http://localhost:8080
+
+and the engine is activated, from the command line commands like this can be 
used for submitting some file as content item, where the mime type must match 
the document type:
+
+* stateless interface
+
+    curl -i -X POST -H "Content-Type:text/html" -T testpage.html 
http://localhost:8080/engines
+
+* stateful interface
+
+    curl -i -X PUT -H "Content-Type:text/html" -T testpage.html 
http://localhost:8080/contenthub/content/someFileId
+
+ Alternatively, the Stanbol web interface can be used for submitting documents
+and viewing the metadata at
+
+    http://localhost:8080/contenthub
+

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext?rev=1179530&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext 
(added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext 
Thu Oct  6 08:03:12 2011
@@ -0,0 +1,64 @@
+Title: Configure Apache Stanbol to work with multiple languages
+
+
+The following languages are supported -
+
+- English
+- German
+- Danish
+- Swedish
+- Dutch
+- Portuguese
+
+
+##Configuration steps
+
+- Have language labels in your target data and install the index
+- Activate the LangIdEnhancementEngine and the KeywordLinkingEngine
+- Add language models to your Stanbol instance
+- Configure the KeywordLinkingEngine
+
+
+###Install your index
+
+In case you want to use an index of your custom vocabulary, first [create an 
index](customvocabulary.html) out of it and then add the index to your stanbol 
instance. Simply paste the <code>{yourindex}.solr.zip</code> into your 
<code>{stanbol-root}/sling/datafiles</code> directory and install the 
respective OSGI bundle at your OSGI admin console.
+
+Make sure, that this index contains language labels in all languages you want 
to work with and that they are properly indexed.
+
+###Build and add the necessary language bundles
+
+To build the language bundles go to "{stanbol-root}/data/" and call
+
+    mvn clean install -P opennlp
+
+This enables the profile to build the OpenNLP models for all languages.
+
+After this the bundles are available in the folder
+
+    {stanbol-root}/data/opennlp/lang/{language}/target
+
+The naming of the bundles is 
"org.apache.stanbol.data.opennlp.lang.{language}-*.jar".
+
+Add the bundle via the OSGI admin console in the bundles tab. The language 
bundles will fetch and install the according 
[OpenNLP](http://dev.iks-project.eu/downloads/opennlp/models-1.5/) models for 
the languages you want to use.
+
+OpenNLP provides language support
+
+
+
+###Activate the LangID engine and the KeywordLinkingEngine
+
+Go to the admin console and deactivate some of the available engines. 
Especially the standard NER engine and the Entity Linking Engines should be 
deactivated, as they do not support multiple languages. At least two engines 
need to be activated:
+
+- The [Language Identification Engine](enhancer/engines/langidengine.html) 
provides you with the language of the text you want to enhance, it creates a 
dc:terms languaage property. The 
+- The [Keyword Linking Engine](enhancer/engines/keywordlinkingengine.html) 
+
+
+
+###Configure the KeywordLinkingEngine
+
+(TODO)
+
+
+##Examples
+
+(TODO)

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext?rev=1179530&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentityextractionengine.mdtext
 Thu Oct  6 08:03:12 2011
@@ -0,0 +1,29 @@
+Title: The Named Entity Recognition Engine: detect Named Entities from 
unstructured text content 
+
+This engine is based on the NLP features of [Apache OpenNLP 
(incubating)](http://incubator.apache.org/opennlp/). It uses its Maximum 
Entropy models to detect Persons, Names and Organizations.
+
+(TODO: features, configuration if possible)
+
+
+## Example Result
+
+This engine adds **TextAnnotation-Enhancements**  for the text "John Smith 
lives in London", (amongst other) the following information to the enhancement 
graph, suggesting London (of type: Place) for the string London:
+
+    {
+      "@subject": "<urn:enhancement-e6a08398-a49f-5bf6-c09f-6da5db63507e>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:TextAnnotation>"
+      ],
+      "dc:created": "2011-10-04T12:36:50.670Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.opennlp.impl.NEREngineCore",
+      "dc:type": "<dbp-ont:Place>",
+      "enhancer:confidence": 0.99691045,
+      "enhancer:end": 26,
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>",
+      "enhancer:selected-text": "London",
+      "enhancer:selection-context": "John Smith lives in London",
+      "enhancer:start": 20
+    }
+
+This enhancement statement provides you with the ID and date of the 
enhancement, the suggested type with a confidence for it, the position of the 
selected text and its (sentence) context as well as the link to the source 
document.

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext?rev=1179530&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/namedentitytaggingengine.mdtext
 Thu Oct  6 08:03:12 2011
@@ -0,0 +1,165 @@
+Title: The Named Entity Tagging Engine: linking text annotations to (external) 
datasets of entities
+
+The Entity Linking Engine uses *[Referenced Sites](../../entityhub.html)* to 
search for Entities based on given Text Annotations.
+
+## Configuration
+
+The configuration decides, which dataset you want to use as linking target. 
The default value is "local" referencing to the default DBpedia index. You may 
also decide on whether given types should restrict the set of possible links. 
E.g. for DBpedia, some organisations are not tagged as such, therefore, you 
want get them with this engine although, you expect them from your dataset.
+
+- Referenced Site: {local, your referenced site}
+
+ *The ID of the Entityhub Referenced Site used for semantic lifting of 
TextAnnotations.* 
+
+- Persons: {true, false}
+ 
+ *Set to TRUE to enable semantic lifting of Persons*
+
+- Person Type {<empty>, dbp-ont:Person}        
+
+ *The rdf:type used to search for Persons. If empty Entities of any type are 
accepted.*
+
+- Organisations        {true, false}
+
+ *Set to TRUE to enable semantic lifting of Organisations*
+
+- Organisation Type    {<empty>, dbp-ont:Organisation}
+
+  *The rdf:type used to search for Organizations. If empty Entities of any 
type are accepted.*
+
+- Places {true, false}
+
+ *Set to TRUE to enable semantic lifting of Places*
+
+- Place Type {<empty>, dbp-ont:Place}  
+
+ *The rdf:type used to search for Places. If empty Entities of any type are 
accepted.*
+
+- Label Field {<empty>, rdfs:label}
+
+ *The field used to search for Entities with a label similar to the selected 
text of the Text Annotation. If empty rdfs:label is used as default*
+
+
+## Example Result
+
+For the sentence "John Smith lives in London", you will get several 
EntityAnnotations for the terms "London", "John Smith" form your linking target 
resource (in this case DBpedia) together with a confidence value, which can be 
used to sort the suggestions.
+
+    {
+      "@subject": "<urn:enhancement-2ec0662c-3a10-f8f5-43b4-cf7403e4c39d>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:EntityAnnotation>"
+      ],
+      "dc:created": "2011-10-06T07:45:04.175Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+      "dc:relation": "<urn:enhancement-0218c6fa-7376-8c9f-c4ed-e973ff72194c>",
+      "enhancer:confidence": 5147829.5,
+      "enhancer:entity-label": "\"London\"@en",
+      "enhancer:entity-reference": "<http://dbpedia.org/resource/London>",
+      "enhancer:entity-type": "<owl:Thing>",
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+    },
+    {
+      "@subject": "<urn:enhancement-44ccea73-639d-394a-8660-fad46795a772>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:EntityAnnotation>"
+      ],
+      "dc:created": "2011-10-06T07:45:06.809Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+      "dc:relation": "<urn:enhancement-4b7b010e-efcc-8752-f055-b73620270295>",
+      "enhancer:confidence": 4.471743,
+      "enhancer:entity-label": "\"John L. Smith\"@en",
+      "enhancer:entity-reference": 
"<http://dbpedia.org/resource/John_L._Smith>",
+      "enhancer:entity-type": "<dbp-ont:CollegeCoach>",
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+    },
+    {
+      "@subject": "<urn:enhancement-4b7b010e-efcc-8752-f055-b73620270295>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:TextAnnotation>"
+      ],
+      "dc:created": "2011-10-06T07:44:52.318Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.opennlp.impl.NEREngineCore",
+      "dc:type": "<dbp-ont:Person>",
+      "enhancer:confidence": 0.66891855,
+      "enhancer:end": 10,
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>",
+      "enhancer:selected-text": "John Smith",
+      "enhancer:selection-context": "John Smith lives in London",
+      "enhancer:start": 0
+    },
+    {
+      "@subject": "<urn:enhancement-708bfdae-c104-19bd-c423-f5c10a11ae55>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:EntityAnnotation>"
+      ],
+      "dc:created": "2011-10-06T07:45:04.216Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+      "dc:relation": "<urn:enhancement-0218c6fa-7376-8c9f-c4ed-e973ff72194c>",
+      "enhancer:confidence": 2543.5994,
+      "enhancer:entity-label": "\"London, Ontario\"@en",
+      "enhancer:entity-reference": 
"<http://dbpedia.org/resource/London,_Ontario>",
+      "enhancer:entity-type": "<http://www.opengis.net/gml/_Feature>",
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+    },
+    {
+      "@subject": "<urn:enhancement-73dce2ac-72b6-b0f4-7c5c-e9c30aec9263>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:EntityAnnotation>"
+      ],
+      "dc:created": "2011-10-06T07:45:04.216Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+      "dc:relation": "<urn:enhancement-0218c6fa-7376-8c9f-c4ed-e973ff72194c>",
+      "enhancer:confidence": 7709.837,
+      "enhancer:entity-label": "\"City of London\"@en",
+      "enhancer:entity-reference": 
"<http://dbpedia.org/resource/City_of_London>",
+      "enhancer:entity-type": "<http://www.opengis.net/gml/_Feature>",
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+    },
+    {
+      "@subject": "<urn:enhancement-c428cb67-cdce-4396-96b8-ac3a8465730a>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:TextAnnotation>"
+      ],
+      "dc:created": "2011-10-06T07:44:39.064Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine",
+      "dc:language": "\"fi\"",
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+    },
+    {
+      "@subject": "<urn:enhancement-c6ffb5f4-a224-9b7d-9854-7eaa101b2ebe>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:EntityAnnotation>"
+      ],
+      "dc:created": "2011-10-06T07:45:06.809Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+      "dc:relation": "<urn:enhancement-4b7b010e-efcc-8752-f055-b73620270295>",
+      "enhancer:confidence": 15.735652,
+      "enhancer:entity-label": "\"John Maynard Smith\"@en",
+      "enhancer:entity-reference": 
"<http://dbpedia.org/resource/John_Maynard_Smith>",
+      "enhancer:entity-type": "<dbp-ont:Scientist>",
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+    },
+    {
+      "@subject": "<urn:enhancement-eeaf0331-5988-5231-493c-f934a2602200>",
+      "@type": [
+        "<enhancer:Enhancement>",
+        "<enhancer:EntityAnnotation>"
+      ],
+      "dc:created": "2011-10-06T07:45:06.809Z",
+      "dc:creator": 
"org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
+      "dc:relation": "<urn:enhancement-4b7b010e-efcc-8752-f055-b73620270295>",
+      "enhancer:confidence": 4.4515367,
+      "enhancer:entity-label": "\"John T. Smith\"@en",
+      "enhancer:entity-reference": 
"<http://dbpedia.org/resource/John_T._Smith>",
+      "enhancer:entity-type": "<owl:Thing>",
+      "enhancer:extracted-from": 
"<urn:content-item-sha1-ea97a3171fe123b27b02497f6eb08b2fca63e6ec>"
+    }
+    ]
+   }
+


Reply via email to