Thank you Rupert!
It is probably something that I missed.
Best,
Srecko
-----Original Message-----
From: Rupert Westenthaler [mailto:[email protected]]
Sent: Thursday, January 12, 2012 20:16
To: Srecko Joksimovic; [email protected]
Cc: [email protected]
Subject: Re: Annotating using DBPedia ontology
Hi Srecko
I seams that both cases are related to the Metaxa Engine. My knowledge abut
the libs used by this engine to extract the textual content is very limited.
So I might not be the right person to look into that.
In the first Example I think Metaxa was not able to extract the text from
the word document because the only plainTextContent triple noted is
<j.0:plainTextContent>Microsoft Word-Dokument
srecko</j.0:plainTextContent>
The second example looks like an issue within the RDF metadata generation
in Aperture.
I sent this replay also directly to Walter Kasper. He is the one who
contributed this engine and should be able to provide a more information.
best
Rupert
On 12.01.2012, at 18:40, srecko joksimovic wrote:
Hi Rupert,
I have another question, and I will finish soon.
I tried to annotate pdf document, and I didn't get result I expected. Then
I put string you sent to me
"John Smith works for the Apple Inc. in Cupertino, California."
in MS Word document, and this is the result I got:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:j.0="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#"
xmlns:j.1="http://purl.org/dc/terms/"
xmlns:j.2="http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#"
xmlns:j.3="http://fise.iks-project.eu/ontology/">
<rdf:Description
rdf:about="urn:enhancement-55016818-eb97-7b98-521a-422e3742173b">
<rdf:type
rdf:resource="http://fise.iks-project.eu/ontology/TextAnnotation"/>
<j.1:creator
rdf:datatype="http://www.w3.org/2001/XMLSchema#string">org.apache.stanbol.en
hancer.engines.langid.LangIdEnhancementEngine</j.1:creator>
<j.1:created
rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-01-12T17:34:20
.288Z</j.1:created>
<j.3:extracted-from
rdf:resource="urn:content-item-sha1-835c8a5397d9b376a268b7bb5d3c8b4ab7e8b81f
"/>
<rdf:type
rdf:resource="http://fise.iks-project.eu/ontology/Enhancement"/>
<j.1:language>fr</j.1:language>
</rdf:Description>
<rdf:Description
rdf:about="urn:content-item-sha1-835c8a5397d9b376a268b7bb5d3c8b4ab7e8b81f">
<rdf:type
rdf:resource="http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Pagin
atedTextDocument"/>
<j.0:plainTextContent>Microsoft Word-Dokument
srecko</j.0:plainTextContent>
</rdf:Description>
<rdf:Description
rdf:about="urn:enhancement-0644a1ed-f1d8-334d-d4e9-690a0446cba8">
<j.3:confidence
rdf:datatype="http://www.w3.org/2001/XMLSchema#double">1.0</j.3:confidence>
<rdf:type
rdf:resource="http://fise.iks-project.eu/ontology/TextAnnotation"/>
<j.1:creator
rdf:datatype="http://www.w3.org/2001/XMLSchema#string">org.apache.stanbol.en
hancer.engines.metaxa.MetaxaEngine</j.1:creator>
<j.1:created
rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-01-12T17:34:20
.273Z</j.1:created>
<j.3:extracted-from
rdf:resource="urn:content-item-sha1-835c8a5397d9b376a268b7bb5d3c8b4ab7e8b81f
"/>
<rdf:type
rdf:resource="http://fise.iks-project.eu/ontology/Enhancement"/>
</rdf:Description>
</rdf:RDF>
and this is the code:
public List<String> Annotate(byte[] _stream_to_annotate,
ServiceUtils.MIMETypes _content_type, String _encoding)
{
List<String> _return_list = new ArrayList<String>();
try
{
URL url = new URL(ServiceUtils.SERVICE_URL);
HttpURLConnection con =
(HttpURLConnection)url.openConnection();
con.setDoOutput(true);
con.setRequestMethod("POST");
con.setRequestProperty("Accept",
"application/rdf+xml");
con.setRequestProperty("Content-type",
_content_type.getValue());
java.io.OutputStream out = con.getOutputStream();
IOUtils.write(_stream_to_annotate, out);
IOUtils.closeQuietly(out);
con.connect(); //send the request
if(con.getResponseCode()> 299)
{
java.io.InputStream errorStream =
con.getErrorStream();
if(errorStream != null)
{
String errorMessage =
IOUtils.toString(errorStream);
IOUtils.closeQuietly(errorStream);
}
else
{
//no error data
//write default error message with
the status code
}
}
else
{
Model model =
ModelFactory.createDefaultModel();
java.io.InputStream enhancementResults =
con.getInputStream();
model.read(enhancementResults, null);
String queryStringForGraph = "PREFIX t:
<http://fise.iks-project.eu/ontology/> " +
"SELECT ?label WHERE {?alias
t:entity-reference ?label}";
Query query =
QueryFactory.create(queryStringForGraph);
QueryExecution qe =
QueryExecutionFactory.create(query, model);
ResultSet results = qe.execSelect();
while(results.hasNext())
{
_return_list.add(results.next().toString());
}
}
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
return _return_list;
}
On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic
<[email protected]> wrote:
Hi Rupert,
Thank you for the answer. I've probably missed that.
Best,
Srecko
On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler
<[email protected]> wrote:
Hi Srecko
I think the last time I directly used this API is about 3-4 years ago, but
after a look at the http client tutorial [1] I think the reason for your
problem is that you do not execute the GetMethod.
Based on this tutorial the code should look like
// Create an instance of HttpClient.
HttpClient client = new HttpClient();
GetMethod get = new GetMethod(url);
try {
// Execute the method.
int statusCode = client.executeMethod(get);
if (statusCode != HttpStatus.SC_OK) {
//handle the error
}
InputStream t_is = get.getResponseBodyAsStream();
//read the data of the stream
}
In addition you should not use a Reader if you want to read byte oriented
data from the input stream.
hope this helps
best
Rupert
[1] http://hc.apache.org/httpclient-3.x/tutorial.html
On 11.01.2012, at 22:34, Srecko Joksimovic wrote:
That's it. Thank you!
I have already configured KeywordLinkingEngine when I used my own
ontology.
I think I'm familiar with that and I will try that option too.
In meanwhile I found another interesting problem. I tried to annotate
document and web page. With web page, I tried
IOUtils.write(byte[], out) and I had to convert URL to byte[]:
public static byte[] GetBytesFromURL(String _url) throws IOException
{
GetMethod get = new GetMethod(_url);
InputStream t_is = get.getResponseBodyAsStream();
byte[] buffer = new byte[1024];
int count = -1;
Reader t_url_reader = new BufferedReader(new
InputStreamReader(t_is));
byte[] t_bytes = IOUtils.toByteArray(t_url_reader, "UTF-8");
return t_bytes;
}
But, the problem is that I'm getting null for InputStream.
Any ideas?
Best,
Srecko
-----Original Message-----
From: Rupert Westenthaler [mailto:[email protected]]
Sent: Wednesday, January 11, 2012 22:08
To: Srecko Joksimovic
Cc: [email protected]
Subject: Re: Annotating using DBPedia ontology
On 11.01.2012, at 21:41, Srecko Joksimovic wrote:
Hi Rupert,
When I load localhost:8080/engines it says this:
There are currently 5 active engines.
org.apache.stanbol.enhancer.engines.metaxa.MetaxaEngine
org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine
org.apache.stanbol.enhancer.engines.opennlp.impl.NamedEntityExtractionEnhanc
ementEngine
org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEng
ine
org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEng
ine
Maybe this could tell you something?
This are exactly the 5 engines that are expected to run with the default
configuration.
Based on this the Stanbol Enhnacer should just work fine.
After looking at the the text you enhanced I noticed however that is
does
not mention
any named entities such as Persons, Organizations and Places. So I
checked
it with
my local Stanbol version and was also not any detected entities.
So to check if Stanbol works as expected you should try to use an other
text
the
mentions some Named Entities such as
"John Smith works for the Apple Inc. in Cupertino, California."
If you want to search also for entities like "Bank", "Blog", "Consumer",
"Telephone" .
you need to also configure a KeywordLinkingEngine for dbpedia. Part B or
[3]
provides
more information on how to do that.
But let me mention that the KeywordLinkingEngine is more useful if used
in
combination
with an own domain specific thesaurus rather than a global data set like
dbpedia. When
used with dbpedia you will also get a lot of false positives.
best
Rupert
[3] http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html