Re: Annotating using DBPedia ontology

Walter Kasper Fri, 13 Jan 2012 05:51:53 -0800

Hi,

Here are recognized standard mime types:


pdf: application/pdf
txt: text/plain
ppt: application/vnd.ms-powerpoint
xls: application/vnd.ms-excel
odt: application/vnd.oasis.opendocument.text

Regards,

Walter

srecko joksimovic wrote:

Hi,

Thank you! I will checkout the last version.

I'm using application/msword, because I thought that is the right one.Could you please send me correct formats for pdf, txt, ppt, xls andodt formats?


Best,
Srecko

On Fri, Jan 13, 2012 at 1:34 PM, Walter Kasper <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    We fixed the problem with unresolved relative URL from HTML
    documents. In the case of your Wikipedia page it came from an
    embedded rel-license microformat. If you are interested only in
    text extraction you can also just disable the RDFa and Microformat
    extractors in the configuration for the html extraction.

    We tested also Word documents with your test sentence. Everything
    worked fine for us. Did you use the correct mime type? The correct
    ones for Word documents are:

    doc-Format (<= Word-2003): application/vnd.ms-word
    docx-Format (Word-2007):
    application/vnd.openxmlformats-officedocument.wordprocessingml

    Best regards,

    Walter

    srecko joksimovic wrote:

        Hi Walter,

        Word document is nothing special, just one line of text:

        "John Smith works for the Apple Inc. in Cupertino, California."

        Rupert suggested this sentence in order to test text
        annotation. As I now
        result after annotating this string, I thought to create Word
        document with
        same content for test purposes.

        The error with your HTML page apparently arises from a bug in
        resolving
        relative URLs in one of the HTML extractors. We will fix that.

        Does it means that I can't annotate HTML page at this moment,
        or that
        depends on page to page basis?

        Best,
        Srecko

        On Fri, Jan 13, 2012 at 9:51 AM, Walter
        Kasper<[email protected] <mailto:[email protected]>>  wrote:

            Hi Srecko,

            I don't know what the problem with your Word document
            could have been.
            Could you send it to me for testing?

            The error with your HTML page apparently arises from a bug
            in resolving
            relative URLs in one of the HTML extractors. We will fix that.

            Best regards,

            Walter


            Srecko Joksimovic wrote:

                Thank you Rupert!

                It is probably something that I missed.

                Best,
                Srecko

                -----Original Message-----
                From: Rupert Westenthaler [mailto:rupert.westenthaler@
                <mailto:rupert.westenthaler@>**gmail.com
                <http://gmail.com><[email protected]
                <mailto:[email protected]>>
                ]
                Sent: Thursday, January 12, 2012 20:16
                To: Srecko Joksimovic; [email protected]
                <mailto:[email protected]>
                Cc:
                
[email protected].**org<[email protected]
                <mailto:[email protected]>>
                Subject: Re: Annotating using DBPedia ontology

                Hi Srecko

                I seams that both cases are related to the Metaxa
                Engine. My knowledge
                abut
                the libs used by this engine to extract the textual
                content is very
                limited.
                So I might not be the right person to look into that.

                In the first Example I think Metaxa was not able to
                extract the text from
                the word document because the only plainTextContent
                triple noted is

                <j.0:plainTextContent>**Microsoft Word-Dokument&#xD;

                srecko</j.0:plainTextContent>

                The  second example looks like an issue within the RDF
                metadata generation
                in Aperture.

                I sent this replay also directly to Walter Kasper. He
                is the one who
                contributed this engine and should be able to provide
                a more information.

                best
                Rupert

                On 12.01.2012, at 18:40, srecko joksimovic wrote:

                 Hi Rupert,

                    I have another question, and I will finish soon.

                    I tried to annotate pdf document, and I didn't get
                    result I expected.
                    Then

                I put string you sent to me

                    "John Smith works for the Apple Inc. in Cupertino,
                    California."
                    in MS Word document, and this is the result I got:

                    <rdf:RDF

xmlns:rdf="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

                    "
                        xmlns:j.0="http://www.**semanticdesktop.org/**
                    <http://semanticdesktop.org/**>
                    
ontologies/2007/01/19/nie#<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
                    "

xmlns:j.1="http://purl.org/dc/**terms/<http://purl.org/dc/terms/>"

                        xmlns:j.2="http://www.**semanticdesktop.org/**
                    <http://semanticdesktop.org/**>
                    
ontologies/2007/03/22/nfo#<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
                    "

xmlns:j.3="http://fise.iks-**project.eu/ontology/

                    
<http://project.eu/ontology/><http://fise.iks-project.eu/ontology/>
                    ">
                    <rdf:Description

                
rdf:about="urn:enhancement-**55016818-eb97-7b98-521a-**422e3742173b">

                    <rdf:type

                
rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation
                
<http://project.eu/ontology/**TextAnnotation><http://fise.iks-project.eu/ontology/TextAnnotation>
                "/>

                    <j.1:creator

                
rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string>
                ">**org.apache.stanbol.en
                hancer.engines.langid.**LangIdEnhancementEngine</j.1:**creator>

                    <j.1:created

                
rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
                ">**2012-01-12T17:34:20

                .288Z</j.1:created>

                    <j.3:extracted-from

                
rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b**
                4ab7e8b81f
                "/>

                    <rdf:type

                
rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement
                
<http://project.eu/ontology/**Enhancement><http://fise.iks-project.eu/ontology/Enhancement>
                "/>

                    <j.1:language>fr</j.1:**language>
                    </rdf:Description>
                    <rdf:Description

                
rdf:about="urn:content-item-**sha1-**835c8a5397d9b376a268b7bb5d3c8b**
                4ab7e8b81f">

                    <rdf:type

                rdf:resource="http://www.**semanticdesktop.org/**
                <http://semanticdesktop.org/**>
                
ontologies/2007/03/22/nfo#**Pagin<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Pagin>
                atedTextDocument"/>

                    <j.0:plainTextContent>**Microsoft Word-Dokument&#xD;

                    srecko</j.0:plainTextContent>
                    </rdf:Description>
                    <rdf:Description

                
rdf:about="urn:enhancement-**0644a1ed-f1d8-334d-d4e9-**690a0446cba8">

                    <j.3:confidence

                
rdf:datatype="http://www.w3.**org/2001/XMLSchema#double<http://www.w3.org/2001/XMLSchema#double>
                ">1.**0</j.3:confidence>

                    <rdf:type

                
rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation
                
<http://project.eu/ontology/**TextAnnotation><http://fise.iks-project.eu/ontology/TextAnnotation>
                "/>

                    <j.1:creator

                
rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string>
                ">**org.apache.stanbol.en
                hancer.engines.metaxa.**MetaxaEngine</j.1:creator>

                    <j.1:created

                
rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
                ">**2012-01-12T17:34:20

                .273Z</j.1:created>

                    <j.3:extracted-from

                
rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b**
                4ab7e8b81f
                "/>

                    <rdf:type

                
rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement
                
<http://project.eu/ontology/**Enhancement><http://fise.iks-project.eu/ontology/Enhancement>


                "/>

                    </rdf:Description>
                    </rdf:RDF>


                    and this is the code:

                           public List<String>   Annotate(byte[]
                    _stream_to_annotate,

                ServiceUtils.MIMETypes _content_type, String _encoding)

                           {
                                   List<String>   _return_list = new
                    ArrayList<String>();
                                   try
                                   {
                                           URL url = new
                    URL(ServiceUtils.SERVICE_URL);
                                           HttpURLConnection con =

                (HttpURLConnection)url.**openConnection();

                                           con.setDoOutput(true);
                                           con.setRequestMethod("POST");

con.setRequestProperty("**Accept",


                "application/rdf+xml");

con.setRequestProperty("**Content-type",


                _content_type.getValue());

                                           java.io.OutputStream out =
                    con.getOutputStream();

IOUtils.write(_stream_to_**annotate, out);


                                           IOUtils.closeQuietly(out);

                                           con.connect(); //send the
                    request

if(con.getResponseCode()>299)

                                           {
                                                   java.io.InputStream
                    errorStream =

                con.getErrorStream();

                                                   if(errorStream != null)
                                                   {
                                                           String
                    errorMessage =

                IOUtils.toString(errorStream);

IOUtils.closeQuietly(**


                    errorStream);
                                                   }
                                                   else
                                                   {
                                                           //no error data
                                                           //write
                    default error message with

                the status code

                                                   }
                                           }
                                           else
                                           {
                                                   Model model =

                ModelFactory.**createDefaultModel();


                                                java.io.InputStream
                enhancementResults =
                con.getInputStream();

model.read(enhancementResults, null);


                                                   String
                    queryStringForGraph =  "PREFIX t:

                
<http://fise.iks-project.eu/**ontology/<http://fise.iks-project.eu/ontology/>>


                 " +

"SELECT ?label WHERE

                    {?alias

                t:entity-reference ?label}";

                                                   Query query =

                QueryFactory.create(**queryStringForGraph);

                                                   QueryExecution qe =

                QueryExecutionFactory.create(**query, model);



                                                   ResultSet results =
                    qe.execSelect();

while(results.hasNext())

                                                   {

                _return_list.add(results.next(**).toString());

                                                   }
                                           }
                                   }
                                   catch(Exception ex)
                                   {

System.out.println(ex.**getMessage());


                                   }
                                   return _return_list;
                           }

                    On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic

                <[email protected]
                <mailto:[email protected]>>   wrote:

                    Hi Rupert,

                    Thank you for the answer. I've probably missed that.

                    Best,
                    Srecko


                    On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler

                <[email protected]
                <mailto:[email protected]>**>   wrote:

                    Hi Srecko

                    I think the last time I directly used this API is
                    about 3-4 years ago,
                    but

                after a look at the http client tutorial [1] I think
                the reason for your
                problem is that you do not execute the GetMethod.

                    Based on this tutorial the code should look like

                       // Create an instance of HttpClient.
                       HttpClient client = new HttpClient();
                       GetMethod get = new GetMethod(url);
                       try {
                           // Execute the method.
                           int statusCode = client.executeMethod(get);
                           if (statusCode != HttpStatus.SC_OK) {
                               //handle the error
                           }
                           InputStream t_is =
                    get.getResponseBodyAsStream();
                           //read the data of the stream
                       }

                    In addition you should not use a Reader if you
                    want to read byte oriented

                data from the input stream.

                    hope this helps
                    best
                    Rupert

                    [1]
                    
http://hc.apache.org/**httpclient-3.x/tutorial.html<http://hc.apache.org/httpclient-3.x/tutorial.html>



                    On 11.01.2012, at 22:34, Srecko Joksimovic wrote:

                     That's it. Thank you!

                        I have already configured KeywordLinkingEngine
                        when I used my own

                    ontology.
                    I think I'm familiar with that and I will try that
                    option too.

                        In meanwhile I found another interesting
                        problem. I tried to annotate
                        document and web page. With web page, I tried
                        IOUtils.write(byte[], out) and I had to
                        convert URL to byte[]:

                        public static byte[] GetBytesFromURL(String
                        _url) throws IOException
                        {
                              GetMethod get = new GetMethod(_url);
                              InputStream t_is =
                        get.getResponseBodyAsStream();
                              byte[] buffer = new byte[1024];
                              int count = -1;
                              Reader t_url_reader = new BufferedReader(new
                        InputStreamReader(t_is));
                              byte[] t_bytes =
                        IOUtils.toByteArray(t_url_**reader, "UTF-8");


                              return t_bytes;
                        }

                        But, the problem is that I'm getting null for
                        InputStream.

                        Any ideas?

                        Best,
                        Srecko



                        -----Original Message-----
                        From: Rupert Westenthaler
                        [mailto:rupert.westenthaler@
                        <mailto:rupert.westenthaler@>**gmail.com
                        <http://gmail.com><[email protected]
                        <mailto:[email protected]>>
                        ]
                        Sent: Wednesday, January 11, 2012 22:08
                        To: Srecko Joksimovic
                        Cc:
                        
[email protected].**org<[email protected]
                        <mailto:[email protected]>>
                        Subject: Re: Annotating using DBPedia ontology


                        On 11.01.2012, at 21:41, Srecko Joksimovic wrote:

                            Hi Rupert,

                            When I load localhost:8080/engines it says
                            this:

                            There are currently 5 active engines.
                            
org.apache.stanbol.enhancer.**engines.metaxa.MetaxaEngine
                            
org.apache.stanbol.enhancer.**engines.langid.**LangIdEnhancementEngine

                             
org.apache.stanbol.enhancer.**engines.opennlp.impl.**

                NamedEntityExtractionEnhanc

                    ementEngine

                             
org.apache.stanbol.enhancer.**engines.entitytagging.impl.**

                NamedEntityTaggingEng

                    ine

                             
org.apache.stanbol.enhancer.**engines.entitytagging.impl.**

                NamedEntityTaggingEng

                    ine

                            Maybe this could tell you something?

                             This are exactly the 5 engines that are
                            expected to run with the

                        default
                        configuration.
                        Based on this the Stanbol Enhnacer should just
                        work fine.

                        After looking at the the text you enhanced I
                        noticed however that is

                    does
                    not mention

                        any named entities such as Persons,
                        Organizations and Places. So I

                    checked
                    it with

                        my local Stanbol version and was also not any
                        detected entities.

                        So to check if Stanbol works as expected you
                        should try to use an other

                    text
                    the

                        mentions some Named Entities such as

                           "John Smith works for the Apple Inc. in
                        Cupertino, California."


                        If you want to search also for entities like
                        "Bank", "Blog", "Consumer",
                        "Telephone" .
                        you need to also configure a
                        KeywordLinkingEngine for dbpedia. Part B or

                    [3]
                    provides

                        more information on how to do that.

                        But let me mention that the
                        KeywordLinkingEngine is more useful if used

                    in
                    combination

                        with an own domain specific thesaurus rather
                        than a global data set like
                        dbpedia. When
                        used with dbpedia you will also get a lot of
                        false positives.

                        best
                        Rupert

                        [3]
                        http://incubator.apache.org/**stanbol/docs/trunk/**
                        
customvocabulary.html<http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html>

Re: Annotating using DBPedia ontology

Reply via email to