Hi Rupert, Stanbol community,
we are happy to finally announce that have fixed the problem with the
NER engine commented in STANBOL-583 and as well added support for
Italian Named Entity Recognition.
I've posted a patch on
https://issues.apache.org/jira/browse/STANBOL-583?focusedCommentId=13285682#comment-13285682
however I also included it in the attachments.
Concerning the comment of Rupert on the pourpose of the TextAnnotation
added by the Lemmatizer component if "completeMorphoAnalysis" option is
deactivated.
The component in that case doesn't provide a morphological analysis
token by token instead it returns the lemmatized version of the whole
textual content, replacing each textual token with is lemma form.
I.e. I'm booking two tickets -> I be book two ticket
If you think that this feature is not useful I could remove it in order
to remove unnecessary configurations.
Let me know
Bests,
Alessio
On 05/19/2012 08:19 PM, Rupert Westenthaler wrote:
Hi Alessio, Stanbol community
Before I start, the current state of the things described in this Mail
can be found in the CELI Engine branch
http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
I made good progress on this issue this week. But most of the work was
not directly on the CELI engines but rather on making Stanbol ready
for the new Engines ^^
A lot of small things where not explicitly specified (e.g. language
annotations STANBOL-613; TopicEnhancements STANBOL-617). This is not a
big deal if there is only a single Engine that provides this feature,
but as soon as there are multiple one needs to ensure compatibility to
give users more freedom when they configure their EnhancementChains.
This changes should ensure that users can easily use one/several/all
of the CELI Engines and - even more important - combine them with all
the existing Stanbol EnhancementEngines.
In addition I have added a new Utility class that can be used in Unit
tests for EnhancementEngines to validate the created Enhancements (see
STANBOL-612). The new EnhancementStructureHelper class is part of the
"o.a.s.enhancer.test" test module and in the meantime used by most of
the Stanbol Enhancement engines (including all CELI engines)
In the following I provide an overview about the changes and the
current state of the Engines
(1) General Changes (valid for all Engines)
* Error Handling: EnhancementEngine MUST NOT catch exceptions that
influence EhancementResults. Users can configure in EnhancementChains
if an Engine is optional or required and the EnhancementJobManager
needs to take care of this. If Engines to catch Exceptions than the
EnhancementJobManager is missing the required Information
* Read/Write locks: EnhancementEngines that use "ENHANCE_ASYNC" need
to use read and write locks when accessing the ContentItem.
* HTTP clients: I changed the clients so that they do no longer create
in-memory copies of the content and the enhancement results. I know
some users that do send pdf documents with 100+ pages to Stanbol and
for such cases it is good to avoid an in-memory copy of 100 pages XML
escaped string.
* fise:selection-context: This property was missing but it is critical
for re-finding the exact location of an TextAnnotation within
non-plain-text systems (e.g. the http://hallojs.org/annotate.html
demo). As the CELI services do not provide this I added an
implementation that uses 50 char before/after the selected text to
create the context.
To make my changes easier to understand I added detailed inline NOTES
describing those changes to the CELI classification Engines. For the
other engines those notes are not present.
(2) Language Identification - READY : Annotates the language as
described by STANBOL-613. This even provides a confidence for the
detected language! Could even provide confidences for other languages
(currently not used).
(3) Lemmatizer - FUNCTIONAL :
I do fully understand the "completeMorphoAnalysis" mode. However I do
not understand for what one would use the TextAnnotation added if
"completeMorphoAnalysis" is deactivated.
NOTES
* this engine uses two properties "fise:hasLemmaForm" and
"fise:hasMorphologicalFeature" ad morphological features are encoded
as "{KEY}={VALUE}" (e.g. "GENDER=FEM", "POS=NF", "NUMBER=PLU"). While
this is OK with me for getting things started this is definitely
something that could be improved on.
* if "completeMorphoAnalysis" is activated this Engine will create a
fise:TextAnnotation for each single word. Resulting in 10 - 15
triples/word. So this Engine might create troubles for long texts.
(4) NER engine - NOT FUNCTIONAL
* The issues described in the last comment of STANBOL-583 [1] still persist.
If those are solved this engine should be ready to be used.
(5) Classification engine - FUNCTIONAL
I aligned this engine, the Zemanta engine and the topic engine to the
same enhancement model (see STANBOL-617). In order to do that I needed
to change some things:
One "<return>{classification}</return>" as returned by the CELI
service is now mapped to one fise:TopicEnhancement. The "label"
element is used as fise:entity-label of the topic and teh
fise:entity-reference is set to the most specific dbpedia ontology
class referenced by the "label" element (see comments in the
ClassificationClientHTTP client for details).
I am not completely sure about those assumptions. So feedback on that
is highly welcome!
(6) TODOs:
I think the main thing is to get rid of the two bugs of the NER
engine. After that I think we can add the CELI engines to the Stanbol
code base.
best
Rupert Westenthaler
[1]
https://issues.apache.org/jira/browse/STANBOL-583?focusedCommentId=13275235&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13275235
(why need perma links to jira issues be so long ...)
--
*************************************
Alessio Bosca, Ph.D.
CELI s.r.l.
Via San Quintino 31
10121 Torino
Tel. +39 011.562.71.15
Fax +39 011.506.40.86
http://www.celi.it
*************************************
Index: engines/celi/src/test/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/CeliNamedEntityExtractionEnhancementEngineTest.java
===================================================================
--- engines/celi/src/test/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/CeliNamedEntityExtractionEnhancementEngineTest.java (revision 1344258)
+++ engines/celi/src/test/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/CeliNamedEntityExtractionEnhancementEngineTest.java (working copy)
@@ -40,14 +40,15 @@
private static final ContentItemFactory ciFactory = InMemoryContentItemFactory.getInstance();
- private static final String TEXT = "Brigitte Bardot, née le 28 septembre 1934 à Paris, est une actrice de cinéma et chanteuse française.";
+ private static final String TEXT_it = "Wolfgang Amadeus Mozart, nome di battesimo Joannes Chrysostomus Wolfgangus Theophilus Mozart (Salisburgo, 27 gennaio 1756 â Vienna, 5 dicembre 1791), è stato un compositore, pianista, organista e violinista.";
+ private static final String TEXT_fr = "Brigitte Bardot, née le 28 septembre 1934 à Paris, est une actrice de cinéma et chanteuse française.";
@BeforeClass
public static void setUpServices() throws IOException, ConfigurationException {
Dictionary<String, Object> properties = new Hashtable<String, Object>();
properties.put(EnhancementEngine.PROPERTY_NAME, "celiNer");
properties.put(CeliNamedEntityExtractionEnhancementEngine.SERVICE_URL, "http://linguagrid.org/LSGrid/ws/com.celi-france.linguagrid.namedentityrecognition.v0u0.demo");
- properties.put(CeliNamedEntityExtractionEnhancementEngine.SUPPORTED_LANGUAGES, "fr");
+ properties.put(CeliNamedEntityExtractionEnhancementEngine.SUPPORTED_LANGUAGES, "fr;it");
MockComponentContext context = new MockComponentContext(properties);
nerEngine.activate(context);
}
@@ -60,17 +61,12 @@
public static ContentItem wrapAsContentItem(final String text) throws IOException {
return ciFactory.createContentItem(new StringSource(text));
}
-
- @Test
- public void tesetEngine() throws Exception {
- ContentItem ci = wrapAsContentItem(TEXT);
+
+ private void testInput(String txt,String lang) throws EngineException, IOException{
+ ContentItem ci = wrapAsContentItem(txt);
try {
- //add a simple triple to statically define the language of the test
- //content
- ci.getMetadata().add(new TripleImpl(ci.getUri(), DC_LANGUAGE, new PlainLiteralImpl("fr")));
- //unit test should not depend on each other (if possible)
- //CeliLanguageIdentifierEnhancementEngineTest.addEnanchements(ci);
-
+ //add a simple triple to statically define the language of the test content
+ ci.getMetadata().add(new TripleImpl(ci.getUri(), DC_LANGUAGE, new PlainLiteralImpl(lang)));
nerEngine.computeEnhancements(ci);
TestUtils.logEnhancements(ci);
@@ -79,7 +75,7 @@
expectedValues.put(Properties.ENHANCER_EXTRACTED_FROM, ci.getUri());
expectedValues.put(Properties.DC_CREATOR, LiteralFactory.getInstance().createTypedLiteral(
nerEngine.getClass().getName()));
- int textAnnoNum = validateAllTextAnnotations(ci.getMetadata(), TEXT, expectedValues);
+ int textAnnoNum = validateAllTextAnnotations(ci.getMetadata(), txt, expectedValues);
log.info(textAnnoNum + " TextAnnotations found ...");
int entityAnnoNum = EnhancementStructureHelper.validateAllEntityAnnotations(ci.getMetadata(),expectedValues);
log.info(entityAnnoNum + " EntityAnnotations found ...");
@@ -90,6 +86,12 @@
}
throw e;
}
+ }
+
+ @Test
+ public void tesetEngine() throws Exception {
+ this.testInput(CeliNamedEntityExtractionEnhancementEngineTest.TEXT_it, "it");
+ this.testInput(CeliNamedEntityExtractionEnhancementEngineTest.TEXT_fr, "fr");
}
// private int checkAllEntityAnnotations(MGraph g) {
Index: engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/NERserviceClientHTTP.java
===================================================================
--- engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/NERserviceClientHTTP.java (revision 1344258)
+++ engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/NERserviceClientHTTP.java (working copy)
@@ -41,14 +41,14 @@
* The XML version, encoding; SOAP envelope, heder and starting element of the body;
* processTextRequest and text starting element.
*/
- private static final String REQUEST_PREFIX = "<?xml version=\"1.0\" encoding=\""+UTF8.name()+"\"?>" +
+ private static final String SOAP_PREFIX = "<?xml version=\"1.0\" encoding=\""+UTF8.name()+"\"?>" +
"<soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" " +
"xmlns:v0u0=\"http://linguagrid.org/ns/namedentityrecognition/v0u0\"><soapenv:Header/>" +
- "<soapenv:Body><v0u0:processTextRequest><v0u0:text>";
+ "<soapenv:Body>";
/**
* closes the text, processTextRequest, SOAP body and envelope
*/
- private static final String REQUEST_SUFFIX = "</v0u0:text></v0u0:processTextRequest></soapenv:Body></soapenv:Envelope>";
+ private static final String SOAP_SUFFIX = "</soapenv:Body></soapenv:Envelope>";
private final URL serviceEP;
private final String licenseKey;
@@ -70,7 +70,7 @@
}
- public List<NamedEntity> extractEntities(String text) throws SOAPException, IOException {
+ public List<NamedEntity> extractEntities(String text, String lang) throws SOAPException, IOException {
if(text == null || text.isEmpty()){
//no text -> no extractions
return Collections.emptyList();
@@ -80,9 +80,11 @@
HttpURLConnection con = Utils.createPostRequest(serviceEP, requestHeaders);
//write content
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(con.getOutputStream(),UTF8));
- writer.write(REQUEST_PREFIX);
+ writer.write(SOAP_PREFIX);
+ writer.write("<v0u0:processTextRequest><v0u0:text>");
StringEscapeUtils.escapeXml(writer, text);
- writer.write(REQUEST_SUFFIX);
+ writer.write("</v0u0:text><v0u0:language>"+lang+"</v0u0:language></v0u0:processTextRequest>");
+ writer.write(SOAP_SUFFIX);
writer.close();
//Call the service
long start = System.currentTimeMillis();
Index: engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/CeliNamedEntityExtractionEnhancementEngine.java
===================================================================
--- engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/CeliNamedEntityExtractionEnhancementEngine.java (revision 1344258)
+++ engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/ner/impl/CeliNamedEntityExtractionEnhancementEngine.java (working copy)
@@ -75,12 +75,12 @@
private static Map<String, UriRef> entityTypes = new HashMap<String, UriRef>();
static {
entityTypes.put("pers", OntologicalClasses.DBPEDIA_PERSON);
+ entityTypes.put("PER", OntologicalClasses.DBPEDIA_PERSON);
entityTypes.put("loc", OntologicalClasses.DBPEDIA_PLACE);
+ entityTypes.put("GPE", OntologicalClasses.DBPEDIA_PLACE);
entityTypes.put("org", OntologicalClasses.DBPEDIA_ORGANISATION);
entityTypes.put("time", OntologicalClasses.SKOS_CONCEPT);
- entityTypes.put("prod", OntologicalClasses.SKOS_CONCEPT);
- entityTypes.put("amount", OntologicalClasses.SKOS_CONCEPT);
}
/**
* The supported languages (configured via the {@link #SUPPORTED_LANGUAGES}
@@ -229,7 +229,7 @@
}
Language lang = new Language(language); //used for the palin literals in TextAnnotations
try {
- List<NamedEntity> lista = this.client.extractEntities(text);
+ List<NamedEntity> lista = this.client.extractEntities(text, language);
LiteralFactory literalFactory = LiteralFactory.getInstance();
MGraph g = ci.getMetadata();
@@ -269,7 +269,7 @@
private Resource getEntityRefForType(String type) {
if (!entityTypes.containsKey(type))
- return null;
+ return OntologicalClasses.SKOS_CONCEPT;
else
return entityTypes.get(type);
}
Index: engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/lemmatizer/impl/LemmatizerClientHTTP.java
===================================================================
--- engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/lemmatizer/impl/LemmatizerClientHTTP.java (revision 1344258)
+++ engines/celi/src/main/java/org/apache/stanbol/enhancer/engines/celi/lemmatizer/impl/LemmatizerClientHTTP.java (working copy)
@@ -72,7 +72,7 @@
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(con.getOutputStream(),UTF8));
//write the SOAP envelope, header and start the body
writer.write(SOAP_REQUEST_PREFIX);
- //wrtie the data (language and text)
+ //write the data (language and text)
writer.write("<mor:inputText lang=\"");
writer.write(lang);
writer.write("\" text=\"");