Hi,
Thank you! I will checkout the last version.
I'm using application/msword, because I thought that is the right one.
Could you please send me correct formats for pdf, txt, ppt, xls and
odt formats?
Best,
Srecko
On Fri, Jan 13, 2012 at 1:34 PM, Walter Kasper <[email protected]
<mailto:[email protected]>> wrote:
Hi,
We fixed the problem with unresolved relative URL from HTML
documents. In the case of your Wikipedia page it came from an
embedded rel-license microformat. If you are interested only in
text extraction you can also just disable the RDFa and Microformat
extractors in the configuration for the html extraction.
We tested also Word documents with your test sentence. Everything
worked fine for us. Did you use the correct mime type? The correct
ones for Word documents are:
doc-Format (<= Word-2003): application/vnd.ms-word
docx-Format (Word-2007):
application/vnd.openxmlformats-officedocument.wordprocessingml
Best regards,
Walter
srecko joksimovic wrote:
Hi Walter,
Word document is nothing special, just one line of text:
"John Smith works for the Apple Inc. in Cupertino, California."
Rupert suggested this sentence in order to test text
annotation. As I now
result after annotating this string, I thought to create Word
document with
same content for test purposes.
The error with your HTML page apparently arises from a bug in
resolving
relative URLs in one of the HTML extractors. We will fix that.
Does it means that I can't annotate HTML page at this moment,
or that
depends on page to page basis?
Best,
Srecko
On Fri, Jan 13, 2012 at 9:51 AM, Walter
Kasper<[email protected] <mailto:[email protected]>> wrote:
Hi Srecko,
I don't know what the problem with your Word document
could have been.
Could you send it to me for testing?
The error with your HTML page apparently arises from a bug
in resolving
relative URLs in one of the HTML extractors. We will fix that.
Best regards,
Walter
Srecko Joksimovic wrote:
Thank you Rupert!
It is probably something that I missed.
Best,
Srecko
-----Original Message-----
From: Rupert Westenthaler [mailto:rupert.westenthaler@
<mailto:rupert.westenthaler@>**gmail.com
<http://gmail.com><[email protected]
<mailto:[email protected]>>
]
Sent: Thursday, January 12, 2012 20:16
To: Srecko Joksimovic; [email protected]
<mailto:[email protected]>
Cc:
[email protected].**org<[email protected]
<mailto:[email protected]>>
Subject: Re: Annotating using DBPedia ontology
Hi Srecko
I seams that both cases are related to the Metaxa
Engine. My knowledge
abut
the libs used by this engine to extract the textual
content is very
limited.
So I might not be the right person to look into that.
In the first Example I think Metaxa was not able to
extract the text from
the word document because the only plainTextContent
triple noted is
<j.0:plainTextContent>**Microsoft Word-Dokument
srecko</j.0:plainTextContent>
The second example looks like an issue within the RDF
metadata generation
in Aperture.
I sent this replay also directly to Walter Kasper. He
is the one who
contributed this engine and should be able to provide
a more information.
best
Rupert
On 12.01.2012, at 18:40, srecko joksimovic wrote:
Hi Rupert,
I have another question, and I will finish soon.
I tried to annotate pdf document, and I didn't get
result I expected.
Then
I put string you sent to me
"John Smith works for the Apple Inc. in Cupertino,
California."
in MS Word document, and this is the result I got:
<rdf:RDF
xmlns:rdf="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
"
xmlns:j.0="http://www.**semanticdesktop.org/**
<http://semanticdesktop.org/**>
ontologies/2007/01/19/nie#<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
"
xmlns:j.1="http://purl.org/dc/**terms/<http://purl.org/dc/terms/>"
xmlns:j.2="http://www.**semanticdesktop.org/**
<http://semanticdesktop.org/**>
ontologies/2007/03/22/nfo#<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
"
xmlns:j.3="http://fise.iks-**project.eu/ontology/
<http://project.eu/ontology/><http://fise.iks-project.eu/ontology/>
">
<rdf:Description
rdf:about="urn:enhancement-**55016818-eb97-7b98-521a-**422e3742173b">
<rdf:type
rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation
<http://project.eu/ontology/**TextAnnotation><http://fise.iks-project.eu/ontology/TextAnnotation>
"/>
<j.1:creator
rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string>
">**org.apache.stanbol.en
hancer.engines.langid.**LangIdEnhancementEngine</j.1:**creator>
<j.1:created
rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
">**2012-01-12T17:34:20
.288Z</j.1:created>
<j.3:extracted-from
rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b**
4ab7e8b81f
"/>
<rdf:type
rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement
<http://project.eu/ontology/**Enhancement><http://fise.iks-project.eu/ontology/Enhancement>
"/>
<j.1:language>fr</j.1:**language>
</rdf:Description>
<rdf:Description
rdf:about="urn:content-item-**sha1-**835c8a5397d9b376a268b7bb5d3c8b**
4ab7e8b81f">
<rdf:type
rdf:resource="http://www.**semanticdesktop.org/**
<http://semanticdesktop.org/**>
ontologies/2007/03/22/nfo#**Pagin<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Pagin>
atedTextDocument"/>
<j.0:plainTextContent>**Microsoft Word-Dokument
srecko</j.0:plainTextContent>
</rdf:Description>
<rdf:Description
rdf:about="urn:enhancement-**0644a1ed-f1d8-334d-d4e9-**690a0446cba8">
<j.3:confidence
rdf:datatype="http://www.w3.**org/2001/XMLSchema#double<http://www.w3.org/2001/XMLSchema#double>
">1.**0</j.3:confidence>
<rdf:type
rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation
<http://project.eu/ontology/**TextAnnotation><http://fise.iks-project.eu/ontology/TextAnnotation>
"/>
<j.1:creator
rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string>
">**org.apache.stanbol.en
hancer.engines.metaxa.**MetaxaEngine</j.1:creator>
<j.1:created
rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
">**2012-01-12T17:34:20
.273Z</j.1:created>
<j.3:extracted-from
rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b**
4ab7e8b81f
"/>
<rdf:type
rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement
<http://project.eu/ontology/**Enhancement><http://fise.iks-project.eu/ontology/Enhancement>
"/>
</rdf:Description>
</rdf:RDF>
and this is the code:
public List<String> Annotate(byte[]
_stream_to_annotate,
ServiceUtils.MIMETypes _content_type, String _encoding)
{
List<String> _return_list = new
ArrayList<String>();
try
{
URL url = new
URL(ServiceUtils.SERVICE_URL);
HttpURLConnection con =
(HttpURLConnection)url.**openConnection();
con.setDoOutput(true);
con.setRequestMethod("POST");
con.setRequestProperty("**Accept",
"application/rdf+xml");
con.setRequestProperty("**Content-type",
_content_type.getValue());
java.io.OutputStream out =
con.getOutputStream();
IOUtils.write(_stream_to_**annotate, out);
IOUtils.closeQuietly(out);
con.connect(); //send the
request
if(con.getResponseCode()>
299)
{
java.io.InputStream
errorStream =
con.getErrorStream();
if(errorStream != null)
{
String
errorMessage =
IOUtils.toString(errorStream);
IOUtils.closeQuietly(**
errorStream);
}
else
{
//no error data
//write
default error message with
the status code
}
}
else
{
Model model =
ModelFactory.**createDefaultModel();
java.io.InputStream
enhancementResults =
con.getInputStream();
model.read(enhancementResults, null);
String
queryStringForGraph = "PREFIX t:
<http://fise.iks-project.eu/**ontology/<http://fise.iks-project.eu/ontology/>>
" +
"SELECT ?label WHERE
{?alias
t:entity-reference ?label}";
Query query =
QueryFactory.create(**queryStringForGraph);
QueryExecution qe =
QueryExecutionFactory.create(**query, model);
ResultSet results =
qe.execSelect();
while(results.hasNext())
{
_return_list.add(results.next(**).toString());
}
}
}
catch(Exception ex)
{
System.out.println(ex.**getMessage());
}
return _return_list;
}
On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic
<[email protected]
<mailto:[email protected]>> wrote:
Hi Rupert,
Thank you for the answer. I've probably missed that.
Best,
Srecko
On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler
<[email protected]
<mailto:[email protected]>**> wrote:
Hi Srecko
I think the last time I directly used this API is
about 3-4 years ago,
but
after a look at the http client tutorial [1] I think
the reason for your
problem is that you do not execute the GetMethod.
Based on this tutorial the code should look like
// Create an instance of HttpClient.
HttpClient client = new HttpClient();
GetMethod get = new GetMethod(url);
try {
// Execute the method.
int statusCode = client.executeMethod(get);
if (statusCode != HttpStatus.SC_OK) {
//handle the error
}
InputStream t_is =
get.getResponseBodyAsStream();
//read the data of the stream
}
In addition you should not use a Reader if you
want to read byte oriented
data from the input stream.
hope this helps
best
Rupert
[1]
http://hc.apache.org/**httpclient-3.x/tutorial.html<http://hc.apache.org/httpclient-3.x/tutorial.html>
On 11.01.2012, at 22:34, Srecko Joksimovic wrote:
That's it. Thank you!
I have already configured KeywordLinkingEngine
when I used my own
ontology.
I think I'm familiar with that and I will try that
option too.
In meanwhile I found another interesting
problem. I tried to annotate
document and web page. With web page, I tried
IOUtils.write(byte[], out) and I had to
convert URL to byte[]:
public static byte[] GetBytesFromURL(String
_url) throws IOException
{
GetMethod get = new GetMethod(_url);
InputStream t_is =
get.getResponseBodyAsStream();
byte[] buffer = new byte[1024];
int count = -1;
Reader t_url_reader = new BufferedReader(new
InputStreamReader(t_is));
byte[] t_bytes =
IOUtils.toByteArray(t_url_**reader, "UTF-8");
return t_bytes;
}
But, the problem is that I'm getting null for
InputStream.
Any ideas?
Best,
Srecko
-----Original Message-----
From: Rupert Westenthaler
[mailto:rupert.westenthaler@
<mailto:rupert.westenthaler@>**gmail.com
<http://gmail.com><[email protected]
<mailto:[email protected]>>
]
Sent: Wednesday, January 11, 2012 22:08
To: Srecko Joksimovic
Cc:
[email protected].**org<[email protected]
<mailto:[email protected]>>
Subject: Re: Annotating using DBPedia ontology
On 11.01.2012, at 21:41, Srecko Joksimovic wrote:
Hi Rupert,
When I load localhost:8080/engines it says
this:
There are currently 5 active engines.
org.apache.stanbol.enhancer.**engines.metaxa.MetaxaEngine
org.apache.stanbol.enhancer.**engines.langid.**LangIdEnhancementEngine
org.apache.stanbol.enhancer.**engines.opennlp.impl.**
NamedEntityExtractionEnhanc
ementEngine
org.apache.stanbol.enhancer.**engines.entitytagging.impl.**
NamedEntityTaggingEng
ine
org.apache.stanbol.enhancer.**engines.entitytagging.impl.**
NamedEntityTaggingEng
ine
Maybe this could tell you something?
This are exactly the 5 engines that are
expected to run with the
default
configuration.
Based on this the Stanbol Enhnacer should just
work fine.
After looking at the the text you enhanced I
noticed however that is
does
not mention
any named entities such as Persons,
Organizations and Places. So I
checked
it with
my local Stanbol version and was also not any
detected entities.
So to check if Stanbol works as expected you
should try to use an other
text
the
mentions some Named Entities such as
"John Smith works for the Apple Inc. in
Cupertino, California."
If you want to search also for entities like
"Bank", "Blog", "Consumer",
"Telephone" .
you need to also configure a
KeywordLinkingEngine for dbpedia. Part B or
[3]
provides
more information on how to do that.
But let me mention that the
KeywordLinkingEngine is more useful if used
in
combination
with an own domain specific thesaurus rather
than a global data set like
dbpedia. When
used with dbpedia you will also get a lot of
false positives.
best
Rupert
[3]
http://incubator.apache.org/**stanbol/docs/trunk/**
customvocabulary.html<http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html>