Re: Annotating using DBPedia ontology

srecko joksimovic Fri, 13 Jan 2012 05:30:22 -0800

Hi,

Thank you! I will checkout the last version.
I'm using application/msword, because I thought that is the right one.
Could you please send me correct formats for pdf, txt, ppt, xls and odt
formats?


Best,
Srecko

On Fri, Jan 13, 2012 at 1:34 PM, Walter Kasper <[email protected]> wrote:

> Hi,
>
> We fixed the problem with unresolved relative URL from HTML documents. In
> the case of your Wikipedia page it came from an embedded rel-license
> microformat. If you are interested only in text extraction you can also
> just disable the RDFa and Microformat extractors in the configuration for
> the html extraction.
>
> We tested also Word documents with your test sentence. Everything worked
> fine for us. Did you use the correct mime type? The correct ones for Word
> documents are:
>
> doc-Format (<= Word-2003): application/vnd.ms-word
> docx-Format (Word-2007): application/vnd.**openxmlformats-officedocument.*
> *wordprocessingml
>
> Best regards,
>
> Walter
>
> srecko joksimovic wrote:
>
>> Hi Walter,
>>
>> Word document is nothing special, just one line of text:
>>
>> "John Smith works for the Apple Inc. in Cupertino, California."
>>
>> Rupert suggested this sentence in order to test text annotation. As I now
>> result after annotating this string, I thought to create Word document
>> with
>> same content for test purposes.
>>
>> The error with your HTML page apparently arises from a bug in resolving
>> relative URLs in one of the HTML extractors. We will fix that.
>>
>> Does it means that I can't annotate HTML page at this moment, or that
>> depends on page to page basis?
>>
>> Best,
>> Srecko
>>
>> On Fri, Jan 13, 2012 at 9:51 AM, Walter Kasper<[email protected]>
>>  wrote:
>>
>>  Hi Srecko,
>>>
>>> I don't know what the problem with your Word document could have been.
>>> Could you send it to me for testing?
>>>
>>> The error with your HTML page apparently arises from a bug in resolving
>>> relative URLs in one of the HTML extractors. We will fix that.
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>>
>>> Srecko Joksimovic wrote:
>>>
>>>  Thank you Rupert!
>>>>
>>>> It is probably something that I missed.
>>>>
>>>> Best,
>>>> Srecko
>>>>
>>>> -----Original Message-----
>>>> From: Rupert Westenthaler 
>>>> [mailto:rupert.westenthaler@****gmail.com<http://gmail.com>
>>>> <rupert.westenthaler@**gmail.com <[email protected]>>
>>>> ]
>>>> Sent: Thursday, January 12, 2012 20:16
>>>> To: Srecko Joksimovic; [email protected]
>>>> Cc: [email protected].****org<stanbol-dev@incubator.**
>>>> apache.org <[email protected]>>
>>>> Subject: Re: Annotating using DBPedia ontology
>>>>
>>>> Hi Srecko
>>>>
>>>> I seams that both cases are related to the Metaxa Engine. My knowledge
>>>> abut
>>>> the libs used by this engine to extract the textual content is very
>>>> limited.
>>>> So I might not be the right person to look into that.
>>>>
>>>> In the first Example I think Metaxa was not able to extract the text
>>>> from
>>>> the word document because the only plainTextContent triple noted is
>>>>
>>>> <j.0:plainTextContent>****Microsoft Word-Dokument&#xD;
>>>>
>>>> srecko</j.0:plainTextContent>
>>>>
>>>> The  second example looks like an issue within the RDF metadata
>>>> generation
>>>> in Aperture.
>>>>
>>>> I sent this replay also directly to Walter Kasper. He is the one who
>>>> contributed this engine and should be able to provide a more
>>>> information.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> On 12.01.2012, at 18:40, srecko joksimovic wrote:
>>>>
>>>>  Hi Rupert,
>>>>
>>>>> I have another question, and I will finish soon.
>>>>>
>>>>> I tried to annotate pdf document, and I didn't get result I expected.
>>>>> Then
>>>>>
>>>>>  I put string you sent to me
>>>>
>>>>  "John Smith works for the Apple Inc. in Cupertino, California."
>>>>> in MS Word document, and this is the result I got:
>>>>>
>>>>> <rdf:RDF
>>>>>     
>>>>> xmlns:rdf="http://www.w3.org/****1999/02/22-rdf-syntax-ns#<http://www.w3.org/**1999/02/22-rdf-syntax-ns#>
>>>>> <htt**p://www.w3.org/1999/02/22-rdf-**syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>> >
>>>>> "
>>>>>     
>>>>> xmlns:j.0="http://www.**semant**icdesktop.org/**<http://semanticdesktop.org/**>
>>>>> ontologies/2007/01/19/nie#<htt**p://www.semanticdesktop.org/**
>>>>> ontologies/2007/01/19/nie#<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
>>>>> >
>>>>> "
>>>>>     xmlns:j.1="http://purl.org/dc/****terms/<http://purl.org/dc/**terms/>
>>>>> <http://purl.org/dc/**terms/ <http://purl.org/dc/terms/>>"
>>>>>     
>>>>> xmlns:j.2="http://www.**semant**icdesktop.org/**<http://semanticdesktop.org/**>
>>>>> ontologies/2007/03/22/nfo#<htt**p://www.semanticdesktop.org/**
>>>>> ontologies/2007/03/22/nfo#<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
>>>>> >
>>>>> "
>>>>>     
>>>>> xmlns:j.3="http://fise.iks-**p**roject.eu/ontology/<http://project.eu/ontology/>
>>>>> <http://**fise.iks-project.eu/ontology/<http://fise.iks-project.eu/ontology/>
>>>>> >
>>>>> ">
>>>>>   <rdf:Description
>>>>>
>>>>>  rdf:about="urn:enhancement-****55016818-eb97-7b98-521a-****
>>>> 422e3742173b">
>>>>
>>>>      <rdf:type
>>>>>
>>>>>  rdf:resource="http://fise.iks-****project.eu/ontology/****
>>>> TextAnnotation <http://project.eu/ontology/**TextAnnotation><
>>>> http://fise.**iks-project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation>
>>>> >
>>>> "/>
>>>>
>>>>      <j.1:creator
>>>>>
>>>>>  rdf:datatype="http://www.w3.****org/2001/XMLSchema#string<http**
>>>> ://www.w3.org/2001/XMLSchema#**string<http://www.w3.org/2001/XMLSchema#string>
>>>> >
>>>> ">**org.apache.stanbol.en
>>>> hancer.engines.langid.****LangIdEnhancementEngine</j.1:****creator>
>>>>
>>>>      <j.1:created
>>>>>
>>>>>  rdf:datatype="http://www.w3.****org/2001/XMLSchema#dateTime<ht**
>>>> tp://www.w3.org/2001/**XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
>>>> >
>>>> ">**2012-01-12T17:34:20
>>>>
>>>> .288Z</j.1:created>
>>>>
>>>>      <j.3:extracted-from
>>>>>
>>>>>  rdf:resource="urn:content-****item-sha1-****
>>>> 835c8a5397d9b376a268b7bb5d3c8b****
>>>> 4ab7e8b81f
>>>> "/>
>>>>
>>>>      <rdf:type
>>>>>
>>>>>  
>>>>> rdf:resource="http://fise.iks-****project.eu/ontology/****Enhancement<http://project.eu/ontology/**Enhancement>
>>>> <http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement>
>>>> >
>>>> "/>
>>>>
>>>>      <j.1:language>fr</j.1:****language>
>>>>>   </rdf:Description>
>>>>>   <rdf:Description
>>>>>
>>>>>  rdf:about="urn:content-item-****sha1-****
>>>> 835c8a5397d9b376a268b7bb5d3c8b****
>>>> 4ab7e8b81f">
>>>>
>>>>      <rdf:type
>>>>>
>>>>>  
>>>>> rdf:resource="http://www.**sem**anticdesktop.org/**<http://semanticdesktop.org/**>
>>>> ontologies/2007/03/22/nfo#****Pagin<http://www.**semanticdesktop.org/**
>>>> ontologies/2007/03/22/nfo#**Pagin<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Pagin>
>>>> >
>>>> atedTextDocument"/>
>>>>
>>>>      <j.0:plainTextContent>****Microsoft Word-Dokument&#xD;
>>>>>
>>>>> srecko</j.0:plainTextContent>
>>>>>   </rdf:Description>
>>>>>   <rdf:Description
>>>>>
>>>>>  rdf:about="urn:enhancement-****0644a1ed-f1d8-334d-d4e9-****
>>>> 690a0446cba8">
>>>>
>>>>      <j.3:confidence
>>>>>
>>>>>  rdf:datatype="http://www.w3.****org/2001/XMLSchema#double<http**
>>>> ://www.w3.org/2001/XMLSchema#**double<http://www.w3.org/2001/XMLSchema#double>
>>>> >
>>>> ">1.**0</j.3:confidence>
>>>>
>>>>      <rdf:type
>>>>>
>>>>>  rdf:resource="http://fise.iks-****project.eu/ontology/****
>>>> TextAnnotation <http://project.eu/ontology/**TextAnnotation><
>>>> http://fise.**iks-project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation>
>>>> >
>>>> "/>
>>>>
>>>>      <j.1:creator
>>>>>
>>>>>  rdf:datatype="http://www.w3.****org/2001/XMLSchema#string<http**
>>>> ://www.w3.org/2001/XMLSchema#**string<http://www.w3.org/2001/XMLSchema#string>
>>>> >
>>>> ">**org.apache.stanbol.en
>>>> hancer.engines.metaxa.****MetaxaEngine</j.1:creator>
>>>>
>>>>      <j.1:created
>>>>>
>>>>>  rdf:datatype="http://www.w3.****org/2001/XMLSchema#dateTime<ht**
>>>> tp://www.w3.org/2001/**XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
>>>> >
>>>> ">**2012-01-12T17:34:20
>>>>
>>>> .273Z</j.1:created>
>>>>
>>>>      <j.3:extracted-from
>>>>>
>>>>>  rdf:resource="urn:content-****item-sha1-****
>>>> 835c8a5397d9b376a268b7bb5d3c8b****
>>>> 4ab7e8b81f
>>>> "/>
>>>>
>>>>      <rdf:type
>>>>>
>>>>>  
>>>>> rdf:resource="http://fise.iks-****project.eu/ontology/****Enhancement<http://project.eu/ontology/**Enhancement>
>>>> <http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement>
>>>> >
>>>>
>>>> "/>
>>>>
>>>>    </rdf:Description>
>>>>> </rdf:RDF>
>>>>>
>>>>>
>>>>> and this is the code:
>>>>>
>>>>>        public List<String>   Annotate(byte[] _stream_to_annotate,
>>>>>
>>>>>  ServiceUtils.MIMETypes _content_type, String _encoding)
>>>>
>>>>         {
>>>>>                List<String>   _return_list = new ArrayList<String>();
>>>>>                try
>>>>>                {
>>>>>                        URL url = new URL(ServiceUtils.SERVICE_URL);
>>>>>                        HttpURLConnection con =
>>>>>
>>>>>  (HttpURLConnection)url.****openConnection();
>>>>
>>>>                         con.setDoOutput(true);
>>>>>                        con.setRequestMethod("POST");
>>>>>                        con.setRequestProperty("****Accept",
>>>>>
>>>>>  "application/rdf+xml");
>>>>
>>>>                         con.setRequestProperty("****Content-type",
>>>>>
>>>>>  _content_type.getValue());
>>>>
>>>>                         java.io.OutputStream out =
>>>>> con.getOutputStream();
>>>>>
>>>>>                        IOUtils.write(_stream_to_****annotate, out);
>>>>>
>>>>>                        IOUtils.closeQuietly(out);
>>>>>
>>>>>                        con.connect(); //send the request
>>>>>
>>>>>                        if(con.getResponseCode()>   299)
>>>>>                        {
>>>>>                                java.io.InputStream errorStream =
>>>>>
>>>>>  con.getErrorStream();
>>>>
>>>>                                 if(errorStream != null)
>>>>>                                {
>>>>>                                        String errorMessage =
>>>>>
>>>>>  IOUtils.toString(errorStream);
>>>>
>>>>                                         IOUtils.closeQuietly(**
>>>>>
>>>>> errorStream);
>>>>>                                }
>>>>>                                else
>>>>>                                {
>>>>>                                        //no error data
>>>>>                                        //write default error message
>>>>> with
>>>>>
>>>>>  the status code
>>>>
>>>>                                 }
>>>>>                        }
>>>>>                        else
>>>>>                        {
>>>>>                                Model model =
>>>>>
>>>>>  ModelFactory.****createDefaultModel();
>>>>
>>>>
>>>>                                 java.io.InputStream enhancementResults =
>>>> con.getInputStream();
>>>>
>>>>                                 model.read(enhancementResults, null);
>>>>
>>>>>                                String queryStringForGraph =  "PREFIX t:
>>>>>
>>>>>  
>>>>> <http://fise.iks-project.eu/****ontology/<http://fise.iks-project.eu/**ontology/>
>>>> <http://fise.iks-**project.eu/ontology/<http://fise.iks-project.eu/ontology/>
>>>> >>
>>>>
>>>>  " +
>>>>
>>>>                                                 "SELECT ?label WHERE
>>>>> {?alias
>>>>>
>>>>>  t:entity-reference ?label}";
>>>>
>>>>                                 Query query =
>>>>>
>>>>>  QueryFactory.create(****queryStringForGraph);
>>>>
>>>>                                 QueryExecution qe =
>>>>>
>>>>>  QueryExecutionFactory.create(****query, model);
>>>>
>>>>
>>>>
>>>>                                 ResultSet results = qe.execSelect();
>>>>>                                while(results.hasNext())
>>>>>                                {
>>>>>
>>>>>  _return_list.add(results.next(****).toString());
>>>>
>>>>                                 }
>>>>>                        }
>>>>>                }
>>>>>                catch(Exception ex)
>>>>>                {
>>>>>                        System.out.println(ex.****getMessage());
>>>>>
>>>>>                }
>>>>>                return _return_list;
>>>>>        }
>>>>>
>>>>> On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic
>>>>>
>>>>>  <[email protected]>   wrote:
>>>>
>>>>  Hi Rupert,
>>>>>
>>>>> Thank you for the answer. I've probably missed that.
>>>>>
>>>>> Best,
>>>>> Srecko
>>>>>
>>>>>
>>>>> On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler
>>>>>
>>>>>  <[email protected]****>   wrote:
>>>>
>>>>  Hi Srecko
>>>>>
>>>>> I think the last time I directly used this API is about 3-4 years ago,
>>>>> but
>>>>>
>>>>>  after a look at the http client tutorial [1] I think the reason for
>>>> your
>>>> problem is that you do not execute the GetMethod.
>>>>
>>>>  Based on this tutorial the code should look like
>>>>>
>>>>>    // Create an instance of HttpClient.
>>>>>    HttpClient client = new HttpClient();
>>>>>    GetMethod get = new GetMethod(url);
>>>>>    try {
>>>>>        // Execute the method.
>>>>>        int statusCode = client.executeMethod(get);
>>>>>        if (statusCode != HttpStatus.SC_OK) {
>>>>>            //handle the error
>>>>>        }
>>>>>        InputStream t_is = get.getResponseBodyAsStream();
>>>>>        //read the data of the stream
>>>>>    }
>>>>>
>>>>> In addition you should not use a Reader if you want to read byte
>>>>> oriented
>>>>>
>>>>>  data from the input stream.
>>>>
>>>>  hope this helps
>>>>> best
>>>>> Rupert
>>>>>
>>>>> [1] 
>>>>> http://hc.apache.org/****httpclient-3.x/tutorial.html<http://hc.apache.org/**httpclient-3.x/tutorial.html>
>>>>> <h**ttp://hc.apache.org/**httpclient-3.x/tutorial.html<http://hc.apache.org/httpclient-3.x/tutorial.html>
>>>>> >
>>>>>
>>>>>
>>>>> On 11.01.2012, at 22:34, Srecko Joksimovic wrote:
>>>>>
>>>>>  That's it. Thank you!
>>>>>
>>>>>> I have already configured KeywordLinkingEngine when I used my own
>>>>>>
>>>>>>  ontology.
>>>>> I think I'm familiar with that and I will try that option too.
>>>>>
>>>>>> In meanwhile I found another interesting problem. I tried to annotate
>>>>>> document and web page. With web page, I tried
>>>>>> IOUtils.write(byte[], out) and I had to convert URL to byte[]:
>>>>>>
>>>>>> public static byte[] GetBytesFromURL(String _url) throws IOException
>>>>>> {
>>>>>>       GetMethod get = new GetMethod(_url);
>>>>>>       InputStream t_is = get.getResponseBodyAsStream();
>>>>>>       byte[] buffer = new byte[1024];
>>>>>>       int count = -1;
>>>>>>       Reader t_url_reader = new BufferedReader(new
>>>>>> InputStreamReader(t_is));
>>>>>>       byte[] t_bytes = IOUtils.toByteArray(t_url_****reader,
>>>>>> "UTF-8");
>>>>>>
>>>>>>
>>>>>>       return t_bytes;
>>>>>> }
>>>>>>
>>>>>> But, the problem is that I'm getting null for InputStream.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Best,
>>>>>> Srecko
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Rupert Westenthaler 
>>>>>> [mailto:rupert.westenthaler@****gmail.com<http://gmail.com>
>>>>>> <rupert.westenthaler@**gmail.com <[email protected]>>
>>>>>> ]
>>>>>> Sent: Wednesday, January 11, 2012 22:08
>>>>>> To: Srecko Joksimovic
>>>>>> Cc: [email protected].****org<stanbol-dev@incubator.**
>>>>>> apache.org <[email protected]>>
>>>>>> Subject: Re: Annotating using DBPedia ontology
>>>>>>
>>>>>>
>>>>>> On 11.01.2012, at 21:41, Srecko Joksimovic wrote:
>>>>>>
>>>>>>  Hi Rupert,
>>>>>>>
>>>>>>> When I load localhost:8080/engines it says this:
>>>>>>>
>>>>>>> There are currently 5 active engines.
>>>>>>> org.apache.stanbol.enhancer.****engines.metaxa.MetaxaEngine
>>>>>>> org.apache.stanbol.enhancer.****engines.langid.****
>>>>>>> LangIdEnhancementEngine
>>>>>>>
>>>>>>>  org.apache.stanbol.enhancer.****engines.opennlp.impl.**
>>>>>>>
>>>>>> NamedEntityExtractionEnhanc
>>>>
>>>>  ementEngine
>>>>>
>>>>>>  org.apache.stanbol.enhancer.****engines.entitytagging.impl.**
>>>>>>>
>>>>>> NamedEntityTaggingEng
>>>>
>>>>  ine
>>>>>
>>>>>>  org.apache.stanbol.enhancer.****engines.entitytagging.impl.**
>>>>>>>
>>>>>> NamedEntityTaggingEng
>>>>
>>>>  ine
>>>>>
>>>>>> Maybe this could tell you something?
>>>>>>>
>>>>>>>  This are exactly the 5 engines that are expected to run with the
>>>>>>>
>>>>>> default
>>>>>> configuration.
>>>>>> Based on this the Stanbol Enhnacer should just work fine.
>>>>>>
>>>>>> After looking at the the text you enhanced I noticed however that is
>>>>>>
>>>>>>  does
>>>>> not mention
>>>>>
>>>>>> any named entities such as Persons, Organizations and Places. So I
>>>>>>
>>>>>>  checked
>>>>> it with
>>>>>
>>>>>> my local Stanbol version and was also not any detected entities.
>>>>>>
>>>>>> So to check if Stanbol works as expected you should try to use an
>>>>>> other
>>>>>>
>>>>>>  text
>>>>> the
>>>>>
>>>>>> mentions some Named Entities such as
>>>>>>
>>>>>>    "John Smith works for the Apple Inc. in Cupertino, California."
>>>>>>
>>>>>>
>>>>>> If you want to search also for entities like "Bank", "Blog",
>>>>>> "Consumer",
>>>>>> "Telephone" .
>>>>>> you need to also configure a KeywordLinkingEngine for dbpedia. Part B
>>>>>> or
>>>>>>
>>>>>>  [3]
>>>>> provides
>>>>>
>>>>>> more information on how to do that.
>>>>>>
>>>>>> But let me mention that the KeywordLinkingEngine is more useful if
>>>>>> used
>>>>>>
>>>>>>  in
>>>>> combination
>>>>>
>>>>>> with an own domain specific thesaurus rather than a global data set
>>>>>> like
>>>>>> dbpedia. When
>>>>>> used with dbpedia you will also get a lot of false positives.
>>>>>>
>>>>>> best
>>>>>> Rupert
>>>>>>
>>>>>> [3] 
>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>> customvocabulary.html<http://**incubator.apache.org/stanbol/**
>>>>>> docs/trunk/customvocabulary.**html<http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>
>
>

Re: Annotating using DBPedia ontology

Reply via email to