Re: Annotating using DBPedia ontology

srecko joksimovic Fri, 13 Jan 2012 02:29:26 -0800

Hi Walter,

Word document is nothing special, just one line of text:


"John Smith works for the Apple Inc. in Cupertino, California."

Rupert suggested this sentence in order to test text annotation. As I now
result after annotating this string, I thought to create Word document with
same content for test purposes.

The error with your HTML page apparently arises from a bug in resolving
relative URLs in one of the HTML extractors. We will fix that.

Does it means that I can't annotate HTML page at this moment, or that
depends on page to page basis?

Best,
Srecko

On Fri, Jan 13, 2012 at 9:51 AM, Walter Kasper <[email protected]> wrote:

> Hi Srecko,
>
> I don't know what the problem with your Word document could have been.
> Could you send it to me for testing?
>
> The error with your HTML page apparently arises from a bug in resolving
> relative URLs in one of the HTML extractors. We will fix that.
>
> Best regards,
>
> Walter
>
>
> Srecko Joksimovic wrote:
>
>> Thank you Rupert!
>>
>> It is probably something that I missed.
>>
>> Best,
>> Srecko
>>
>> -----Original Message-----
>> From: Rupert Westenthaler 
>> [mailto:rupert.westenthaler@**gmail.com<[email protected]>
>> ]
>> Sent: Thursday, January 12, 2012 20:16
>> To: Srecko Joksimovic; [email protected]
>> Cc: [email protected].**org <[email protected]>
>> Subject: Re: Annotating using DBPedia ontology
>>
>> Hi Srecko
>>
>> I seams that both cases are related to the Metaxa Engine. My knowledge
>> abut
>> the libs used by this engine to extract the textual content is very
>> limited.
>> So I might not be the right person to look into that.
>>
>> In the first Example I think Metaxa was not able to extract the text from
>> the word document because the only plainTextContent triple noted is
>>
>> <j.0:plainTextContent>**Microsoft Word-Dokument&#xD;
>> srecko</j.0:plainTextContent>
>>
>> The  second example looks like an issue within the RDF metadata generation
>> in Aperture.
>>
>> I sent this replay also directly to Walter Kasper. He is the one who
>> contributed this engine and should be able to provide a more information.
>>
>> best
>> Rupert
>>
>> On 12.01.2012, at 18:40, srecko joksimovic wrote:
>>
>>  Hi Rupert,
>>>
>>> I have another question, and I will finish soon.
>>>
>>> I tried to annotate pdf document, and I didn't get result I expected.
>>> Then
>>>
>> I put string you sent to me
>>
>>> "John Smith works for the Apple Inc. in Cupertino, California."
>>> in MS Word document, and this is the result I got:
>>>
>>> <rdf:RDF
>>>     
>>> xmlns:rdf="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>> "
>>>     xmlns:j.0="http://www.**semanticdesktop.org/**
>>> ontologies/2007/01/19/nie#<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
>>> "
>>>     xmlns:j.1="http://purl.org/dc/**terms/ <http://purl.org/dc/terms/>"
>>>     xmlns:j.2="http://www.**semanticdesktop.org/**
>>> ontologies/2007/03/22/nfo#<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
>>> "
>>>     
>>> xmlns:j.3="http://fise.iks-**project.eu/ontology/<http://fise.iks-project.eu/ontology/>
>>> ">
>>>   <rdf:Description
>>>
>> rdf:about="urn:enhancement-**55016818-eb97-7b98-521a-**422e3742173b">
>>
>>>     <rdf:type
>>>
>> rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation>
>> "/>
>>
>>>     <j.1:creator
>>>
>> rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string>
>> ">**org.apache.stanbol.en
>> hancer.engines.langid.**LangIdEnhancementEngine</j.1:**creator>
>>
>>>     <j.1:created
>>>
>> rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
>> ">**2012-01-12T17:34:20
>> .288Z</j.1:created>
>>
>>>     <j.3:extracted-from
>>>
>> rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b**
>> 4ab7e8b81f
>> "/>
>>
>>>     <rdf:type
>>>
>> rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement>
>> "/>
>>
>>>     <j.1:language>fr</j.1:**language>
>>>   </rdf:Description>
>>>   <rdf:Description
>>>
>> rdf:about="urn:content-item-**sha1-**835c8a5397d9b376a268b7bb5d3c8b**
>> 4ab7e8b81f">
>>
>>>     <rdf:type
>>>
>> rdf:resource="http://www.**semanticdesktop.org/**
>> ontologies/2007/03/22/nfo#**Pagin<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Pagin>
>> atedTextDocument"/>
>>
>>>     <j.0:plainTextContent>**Microsoft Word-Dokument&#xD;
>>> srecko</j.0:plainTextContent>
>>>   </rdf:Description>
>>>   <rdf:Description
>>>
>> rdf:about="urn:enhancement-**0644a1ed-f1d8-334d-d4e9-**690a0446cba8">
>>
>>>     <j.3:confidence
>>>
>> rdf:datatype="http://www.w3.**org/2001/XMLSchema#double<http://www.w3.org/2001/XMLSchema#double>
>> ">1.**0</j.3:confidence>
>>
>>>     <rdf:type
>>>
>> rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation>
>> "/>
>>
>>>     <j.1:creator
>>>
>> rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string>
>> ">**org.apache.stanbol.en
>> hancer.engines.metaxa.**MetaxaEngine</j.1:creator>
>>
>>>     <j.1:created
>>>
>> rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime>
>> ">**2012-01-12T17:34:20
>> .273Z</j.1:created>
>>
>>>     <j.3:extracted-from
>>>
>> rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b**
>> 4ab7e8b81f
>> "/>
>>
>>>     <rdf:type
>>>
>> rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement>
>> "/>
>>
>>>   </rdf:Description>
>>> </rdf:RDF>
>>>
>>>
>>> and this is the code:
>>>
>>>        public List<String>  Annotate(byte[] _stream_to_annotate,
>>>
>> ServiceUtils.MIMETypes _content_type, String _encoding)
>>
>>>        {
>>>                List<String>  _return_list = new ArrayList<String>();
>>>                try
>>>                {
>>>                        URL url = new URL(ServiceUtils.SERVICE_URL);
>>>                        HttpURLConnection con =
>>>
>> (HttpURLConnection)url.**openConnection();
>>
>>>                        con.setDoOutput(true);
>>>                        con.setRequestMethod("POST");
>>>                        con.setRequestProperty("**Accept",
>>>
>> "application/rdf+xml");
>>
>>>                        con.setRequestProperty("**Content-type",
>>>
>> _content_type.getValue());
>>
>>>
>>>                        java.io.OutputStream out = con.getOutputStream();
>>>
>>>                        IOUtils.write(_stream_to_**annotate, out);
>>>                        IOUtils.closeQuietly(out);
>>>
>>>                        con.connect(); //send the request
>>>
>>>                        if(con.getResponseCode()>  299)
>>>                        {
>>>                                java.io.InputStream errorStream =
>>>
>> con.getErrorStream();
>>
>>>                                if(errorStream != null)
>>>                                {
>>>                                        String errorMessage =
>>>
>> IOUtils.toString(errorStream);
>>
>>>                                        IOUtils.closeQuietly(**
>>> errorStream);
>>>                                }
>>>                                else
>>>                                {
>>>                                        //no error data
>>>                                        //write default error message with
>>>
>> the status code
>>
>>>                                }
>>>                        }
>>>                        else
>>>                        {
>>>                                Model model =
>>>
>> ModelFactory.**createDefaultModel();
>>
>>                                 java.io.InputStream enhancementResults =
>>>
>> con.getInputStream();
>>
>>                                 model.read(enhancementResults, null);
>>>                                String queryStringForGraph =  "PREFIX t:
>>>
>> <http://fise.iks-project.eu/**ontology/<http://fise.iks-project.eu/ontology/>>
>>  " +
>>
>>>                                                "SELECT ?label WHERE
>>> {?alias
>>>
>> t:entity-reference ?label}";
>>
>>>                                Query query =
>>>
>> QueryFactory.create(**queryStringForGraph);
>>
>>>                                QueryExecution qe =
>>>
>> QueryExecutionFactory.create(**query, model);
>>
>>
>>>
>>>                                ResultSet results = qe.execSelect();
>>>                                while(results.hasNext())
>>>                                {
>>>
>> _return_list.add(results.next(**).toString());
>>
>>>                                }
>>>                        }
>>>                }
>>>                catch(Exception ex)
>>>                {
>>>                        System.out.println(ex.**getMessage());
>>>                }
>>>                return _return_list;
>>>        }
>>>
>>> On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic
>>>
>> <[email protected]>  wrote:
>>
>>> Hi Rupert,
>>>
>>> Thank you for the answer. I've probably missed that.
>>>
>>> Best,
>>> Srecko
>>>
>>>
>>> On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler
>>>
>> <[email protected]**>  wrote:
>>
>>> Hi Srecko
>>>
>>> I think the last time I directly used this API is about 3-4 years ago,
>>> but
>>>
>> after a look at the http client tutorial [1] I think the reason for your
>> problem is that you do not execute the GetMethod.
>>
>>> Based on this tutorial the code should look like
>>>
>>>    // Create an instance of HttpClient.
>>>    HttpClient client = new HttpClient();
>>>    GetMethod get = new GetMethod(url);
>>>    try {
>>>        // Execute the method.
>>>        int statusCode = client.executeMethod(get);
>>>        if (statusCode != HttpStatus.SC_OK) {
>>>            //handle the error
>>>        }
>>>        InputStream t_is = get.getResponseBodyAsStream();
>>>        //read the data of the stream
>>>    }
>>>
>>> In addition you should not use a Reader if you want to read byte oriented
>>>
>> data from the input stream.
>>
>>> hope this helps
>>> best
>>> Rupert
>>>
>>> [1] 
>>> http://hc.apache.org/**httpclient-3.x/tutorial.html<http://hc.apache.org/httpclient-3.x/tutorial.html>
>>>
>>> On 11.01.2012, at 22:34, Srecko Joksimovic wrote:
>>>
>>>  That's it. Thank you!
>>>> I have already configured KeywordLinkingEngine when I used my own
>>>>
>>> ontology.
>>
>>> I think I'm familiar with that and I will try that option too.
>>>>
>>>> In meanwhile I found another interesting problem. I tried to annotate
>>>> document and web page. With web page, I tried
>>>> IOUtils.write(byte[], out) and I had to convert URL to byte[]:
>>>>
>>>> public static byte[] GetBytesFromURL(String _url) throws IOException
>>>> {
>>>>       GetMethod get = new GetMethod(_url);
>>>>       InputStream t_is = get.getResponseBodyAsStream();
>>>>       byte[] buffer = new byte[1024];
>>>>       int count = -1;
>>>>       Reader t_url_reader = new BufferedReader(new
>>>> InputStreamReader(t_is));
>>>>       byte[] t_bytes = IOUtils.toByteArray(t_url_**reader, "UTF-8");
>>>>
>>>>       return t_bytes;
>>>> }
>>>>
>>>> But, the problem is that I'm getting null for InputStream.
>>>>
>>>> Any ideas?
>>>>
>>>> Best,
>>>> Srecko
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Rupert Westenthaler 
>>>> [mailto:rupert.westenthaler@**gmail.com<[email protected]>
>>>> ]
>>>> Sent: Wednesday, January 11, 2012 22:08
>>>> To: Srecko Joksimovic
>>>> Cc: [email protected].**org<[email protected]>
>>>> Subject: Re: Annotating using DBPedia ontology
>>>>
>>>>
>>>> On 11.01.2012, at 21:41, Srecko Joksimovic wrote:
>>>>
>>>>> Hi Rupert,
>>>>>
>>>>> When I load localhost:8080/engines it says this:
>>>>>
>>>>> There are currently 5 active engines.
>>>>> org.apache.stanbol.enhancer.**engines.metaxa.MetaxaEngine
>>>>> org.apache.stanbol.enhancer.**engines.langid.**LangIdEnhancementEngine
>>>>>
>>>>>  org.apache.stanbol.enhancer.**engines.opennlp.impl.**
>> NamedEntityExtractionEnhanc
>>
>>> ementEngine
>>>>>
>>>>>  org.apache.stanbol.enhancer.**engines.entitytagging.impl.**
>> NamedEntityTaggingEng
>>
>>> ine
>>>>>
>>>>>  org.apache.stanbol.enhancer.**engines.entitytagging.impl.**
>> NamedEntityTaggingEng
>>
>>> ine
>>>>>
>>>>> Maybe this could tell you something?
>>>>>
>>>>>  This are exactly the 5 engines that are expected to run with the
>>>> default
>>>> configuration.
>>>> Based on this the Stanbol Enhnacer should just work fine.
>>>>
>>>> After looking at the the text you enhanced I noticed however that is
>>>>
>>> does
>>
>>> not mention
>>>> any named entities such as Persons, Organizations and Places. So I
>>>>
>>> checked
>>
>>> it with
>>>> my local Stanbol version and was also not any detected entities.
>>>>
>>>> So to check if Stanbol works as expected you should try to use an other
>>>>
>>> text
>>
>>> the
>>>> mentions some Named Entities such as
>>>>
>>>>    "John Smith works for the Apple Inc. in Cupertino, California."
>>>>
>>>>
>>>> If you want to search also for entities like "Bank", "Blog", "Consumer",
>>>> "Telephone" .
>>>> you need to also configure a KeywordLinkingEngine for dbpedia. Part B or
>>>>
>>> [3]
>>
>>> provides
>>>> more information on how to do that.
>>>>
>>>> But let me mention that the KeywordLinkingEngine is more useful if used
>>>>
>>> in
>>
>>> combination
>>>> with an own domain specific thesaurus rather than a global data set like
>>>> dbpedia. When
>>>> used with dbpedia you will also get a lot of false positives.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> [3] http://incubator.apache.org/**stanbol/docs/trunk/**
>>>> customvocabulary.html<http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html>
>>>>
>>>>
>>>
>>>
>
>

Re: Annotating using DBPedia ontology

Reply via email to