Hi,

We fixed the problem with unresolved relative URL from HTML documents. In the case of your Wikipedia page it came from an embedded rel-license microformat. If you are interested only in text extraction you can also just disable the RDFa and Microformat extractors in the configuration for the html extraction.

We tested also Word documents with your test sentence. Everything worked fine for us. Did you use the correct mime type? The correct ones for Word documents are:

doc-Format (<= Word-2003): application/vnd.ms-word
docx-Format (Word-2007): application/vnd.openxmlformats-officedocument.wordprocessingml

Best regards,

Walter

srecko joksimovic wrote:
Hi Walter,

Word document is nothing special, just one line of text:

"John Smith works for the Apple Inc. in Cupertino, California."

Rupert suggested this sentence in order to test text annotation. As I now
result after annotating this string, I thought to create Word document with
same content for test purposes.

The error with your HTML page apparently arises from a bug in resolving
relative URLs in one of the HTML extractors. We will fix that.

Does it means that I can't annotate HTML page at this moment, or that
depends on page to page basis?

Best,
Srecko

On Fri, Jan 13, 2012 at 9:51 AM, Walter Kasper<[email protected]>  wrote:

Hi Srecko,

I don't know what the problem with your Word document could have been.
Could you send it to me for testing?

The error with your HTML page apparently arises from a bug in resolving
relative URLs in one of the HTML extractors. We will fix that.

Best regards,

Walter


Srecko Joksimovic wrote:

Thank you Rupert!

It is probably something that I missed.

Best,
Srecko

-----Original Message-----
From: Rupert Westenthaler 
[mailto:rupert.westenthaler@**gmail.com<[email protected]>
]
Sent: Thursday, January 12, 2012 20:16
To: Srecko Joksimovic; [email protected]
Cc: [email protected].**org<[email protected]>
Subject: Re: Annotating using DBPedia ontology

Hi Srecko

I seams that both cases are related to the Metaxa Engine. My knowledge
abut
the libs used by this engine to extract the textual content is very
limited.
So I might not be the right person to look into that.

In the first Example I think Metaxa was not able to extract the text from
the word document because the only plainTextContent triple noted is

<j.0:plainTextContent>**Microsoft Word-Dokument&#xD;
srecko</j.0:plainTextContent>

The  second example looks like an issue within the RDF metadata generation
in Aperture.

I sent this replay also directly to Walter Kasper. He is the one who
contributed this engine and should be able to provide a more information.

best
Rupert

On 12.01.2012, at 18:40, srecko joksimovic wrote:

  Hi Rupert,
I have another question, and I will finish soon.

I tried to annotate pdf document, and I didn't get result I expected.
Then

I put string you sent to me

"John Smith works for the Apple Inc. in Cupertino, California."
in MS Word document, and this is the result I got:

<rdf:RDF
     
xmlns:rdf="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
"
     xmlns:j.0="http://www.**semanticdesktop.org/**
ontologies/2007/01/19/nie#<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
"
     xmlns:j.1="http://purl.org/dc/**terms/<http://purl.org/dc/terms/>"
     xmlns:j.2="http://www.**semanticdesktop.org/**
ontologies/2007/03/22/nfo#<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
"
     
xmlns:j.3="http://fise.iks-**project.eu/ontology/<http://fise.iks-project.eu/ontology/>
">
   <rdf:Description

rdf:about="urn:enhancement-**55016818-eb97-7b98-521a-**422e3742173b">

     <rdf:type

rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation>
"/>

     <j.1:creator

rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string>
">**org.apache.stanbol.en
hancer.engines.langid.**LangIdEnhancementEngine</j.1:**creator>

     <j.1:created

rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
">**2012-01-12T17:34:20
.288Z</j.1:created>

     <j.3:extracted-from

rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b**
4ab7e8b81f
"/>

     <rdf:type

rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement>
"/>

     <j.1:language>fr</j.1:**language>
   </rdf:Description>
   <rdf:Description

rdf:about="urn:content-item-**sha1-**835c8a5397d9b376a268b7bb5d3c8b**
4ab7e8b81f">

     <rdf:type

rdf:resource="http://www.**semanticdesktop.org/**
ontologies/2007/03/22/nfo#**Pagin<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Pagin>
atedTextDocument"/>

     <j.0:plainTextContent>**Microsoft Word-Dokument&#xD;
srecko</j.0:plainTextContent>
   </rdf:Description>
   <rdf:Description

rdf:about="urn:enhancement-**0644a1ed-f1d8-334d-d4e9-**690a0446cba8">

     <j.3:confidence

rdf:datatype="http://www.w3.**org/2001/XMLSchema#double<http://www.w3.org/2001/XMLSchema#double>
">1.**0</j.3:confidence>

     <rdf:type

rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation>
"/>

     <j.1:creator

rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string>
">**org.apache.stanbol.en
hancer.engines.metaxa.**MetaxaEngine</j.1:creator>

     <j.1:created

rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
">**2012-01-12T17:34:20
.273Z</j.1:created>

     <j.3:extracted-from

rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b**
4ab7e8b81f
"/>

     <rdf:type

rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement>
"/>

   </rdf:Description>
</rdf:RDF>


and this is the code:

        public List<String>   Annotate(byte[] _stream_to_annotate,

ServiceUtils.MIMETypes _content_type, String _encoding)

        {
                List<String>   _return_list = new ArrayList<String>();
                try
                {
                        URL url = new URL(ServiceUtils.SERVICE_URL);
                        HttpURLConnection con =

(HttpURLConnection)url.**openConnection();

                        con.setDoOutput(true);
                        con.setRequestMethod("POST");
                        con.setRequestProperty("**Accept",

"application/rdf+xml");

                        con.setRequestProperty("**Content-type",

_content_type.getValue());

                        java.io.OutputStream out = con.getOutputStream();

                        IOUtils.write(_stream_to_**annotate, out);
                        IOUtils.closeQuietly(out);

                        con.connect(); //send the request

                        if(con.getResponseCode()>   299)
                        {
                                java.io.InputStream errorStream =

con.getErrorStream();

                                if(errorStream != null)
                                {
                                        String errorMessage =

IOUtils.toString(errorStream);

                                        IOUtils.closeQuietly(**
errorStream);
                                }
                                else
                                {
                                        //no error data
                                        //write default error message with

the status code

                                }
                        }
                        else
                        {
                                Model model =

ModelFactory.**createDefaultModel();

                                 java.io.InputStream enhancementResults =
con.getInputStream();

                                 model.read(enhancementResults, null);
                                String queryStringForGraph =  "PREFIX t:

<http://fise.iks-project.eu/**ontology/<http://fise.iks-project.eu/ontology/>>
  " +

                                                "SELECT ?label WHERE
{?alias

t:entity-reference ?label}";

                                Query query =

QueryFactory.create(**queryStringForGraph);

                                QueryExecution qe =

QueryExecutionFactory.create(**query, model);


                                ResultSet results = qe.execSelect();
                                while(results.hasNext())
                                {

_return_list.add(results.next(**).toString());

                                }
                        }
                }
                catch(Exception ex)
                {
                        System.out.println(ex.**getMessage());
                }
                return _return_list;
        }

On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic

<[email protected]>   wrote:

Hi Rupert,

Thank you for the answer. I've probably missed that.

Best,
Srecko


On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler

<[email protected]**>   wrote:

Hi Srecko

I think the last time I directly used this API is about 3-4 years ago,
but

after a look at the http client tutorial [1] I think the reason for your
problem is that you do not execute the GetMethod.

Based on this tutorial the code should look like

    // Create an instance of HttpClient.
    HttpClient client = new HttpClient();
    GetMethod get = new GetMethod(url);
    try {
        // Execute the method.
        int statusCode = client.executeMethod(get);
        if (statusCode != HttpStatus.SC_OK) {
            //handle the error
        }
        InputStream t_is = get.getResponseBodyAsStream();
        //read the data of the stream
    }

In addition you should not use a Reader if you want to read byte oriented

data from the input stream.

hope this helps
best
Rupert

[1] 
http://hc.apache.org/**httpclient-3.x/tutorial.html<http://hc.apache.org/httpclient-3.x/tutorial.html>

On 11.01.2012, at 22:34, Srecko Joksimovic wrote:

  That's it. Thank you!
I have already configured KeywordLinkingEngine when I used my own

ontology.
I think I'm familiar with that and I will try that option too.
In meanwhile I found another interesting problem. I tried to annotate
document and web page. With web page, I tried
IOUtils.write(byte[], out) and I had to convert URL to byte[]:

public static byte[] GetBytesFromURL(String _url) throws IOException
{
       GetMethod get = new GetMethod(_url);
       InputStream t_is = get.getResponseBodyAsStream();
       byte[] buffer = new byte[1024];
       int count = -1;
       Reader t_url_reader = new BufferedReader(new
InputStreamReader(t_is));
       byte[] t_bytes = IOUtils.toByteArray(t_url_**reader, "UTF-8");

       return t_bytes;
}

But, the problem is that I'm getting null for InputStream.

Any ideas?

Best,
Srecko



-----Original Message-----
From: Rupert Westenthaler 
[mailto:rupert.westenthaler@**gmail.com<[email protected]>
]
Sent: Wednesday, January 11, 2012 22:08
To: Srecko Joksimovic
Cc: [email protected].**org<[email protected]>
Subject: Re: Annotating using DBPedia ontology


On 11.01.2012, at 21:41, Srecko Joksimovic wrote:

Hi Rupert,

When I load localhost:8080/engines it says this:

There are currently 5 active engines.
org.apache.stanbol.enhancer.**engines.metaxa.MetaxaEngine
org.apache.stanbol.enhancer.**engines.langid.**LangIdEnhancementEngine

  org.apache.stanbol.enhancer.**engines.opennlp.impl.**
NamedEntityExtractionEnhanc

ementEngine
  org.apache.stanbol.enhancer.**engines.entitytagging.impl.**
NamedEntityTaggingEng

ine
  org.apache.stanbol.enhancer.**engines.entitytagging.impl.**
NamedEntityTaggingEng

ine
Maybe this could tell you something?

  This are exactly the 5 engines that are expected to run with the
default
configuration.
Based on this the Stanbol Enhnacer should just work fine.

After looking at the the text you enhanced I noticed however that is

does
not mention
any named entities such as Persons, Organizations and Places. So I

checked
it with
my local Stanbol version and was also not any detected entities.

So to check if Stanbol works as expected you should try to use an other

text
the
mentions some Named Entities such as

    "John Smith works for the Apple Inc. in Cupertino, California."


If you want to search also for entities like "Bank", "Blog", "Consumer",
"Telephone" .
you need to also configure a KeywordLinkingEngine for dbpedia. Part B or

[3]
provides
more information on how to do that.

But let me mention that the KeywordLinkingEngine is more useful if used

in
combination
with an own domain specific thesaurus rather than a global data set like
dbpedia. When
used with dbpedia you will also get a lot of false positives.

best
Rupert

[3] http://incubator.apache.org/**stanbol/docs/trunk/**
customvocabulary.html<http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html>







Reply via email to