Re: Annotating using DBPedia ontology

Rupert Westenthaler Thu, 12 Jan 2012 11:16:51 -0800

Hi Srecko

I seams that both cases are related to the Metaxa Engine. My knowledge abut the 
libs used by this engine to extract the textual content is very limited. So I 
might not be the right person to look into that.


In the first Example I think Metaxa was not able to extract the text from the 
word document because the only plainTextContent triple noted is

<j.0:plainTextContent>Microsoft Word-Dokument&#xD;
srecko</j.0:plainTextContent>

The  second example looks like an issue within the RDF metadata generation in 
Aperture.
 
I sent this replay also directly to Walter Kasper. He is the one who 
contributed this engine and should be able to provide a more information.

best
Rupert

On 12.01.2012, at 18:40, srecko joksimovic wrote:

> Hi Rupert,
> 
> I have another question, and I will finish soon.
> 
> I tried to annotate pdf document, and I didn't get result I expected. Then I 
> put string you sent to me 
> "John Smith works for the Apple Inc. in Cupertino, California."
> in MS Word document, and this is the result I got:
> 
> <rdf:RDF
>     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
>     xmlns:j.0="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#";
>     xmlns:j.1="http://purl.org/dc/terms/";
>     xmlns:j.2="http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#";
>     xmlns:j.3="http://fise.iks-project.eu/ontology/"; > 
>   <rdf:Description 
> rdf:about="urn:enhancement-55016818-eb97-7b98-521a-422e3742173b">
>     <rdf:type 
> rdf:resource="http://fise.iks-project.eu/ontology/TextAnnotation"/>
>     <j.1:creator 
> rdf:datatype="http://www.w3.org/2001/XMLSchema#string";>org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine</j.1:creator>
>     <j.1:created 
> rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime";>2012-01-12T17:34:20.288Z</j.1:created>
>     <j.3:extracted-from 
> rdf:resource="urn:content-item-sha1-835c8a5397d9b376a268b7bb5d3c8b4ab7e8b81f"/>
>     <rdf:type rdf:resource="http://fise.iks-project.eu/ontology/Enhancement"/>
>     <j.1:language>fr</j.1:language>
>   </rdf:Description>
>   <rdf:Description 
> rdf:about="urn:content-item-sha1-835c8a5397d9b376a268b7bb5d3c8b4ab7e8b81f">
>     <rdf:type 
> rdf:resource="http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#PaginatedTextDocument"/>
>     <j.0:plainTextContent>Microsoft Word-Dokument&#xD;
> srecko</j.0:plainTextContent>
>   </rdf:Description>
>   <rdf:Description 
> rdf:about="urn:enhancement-0644a1ed-f1d8-334d-d4e9-690a0446cba8">
>     <j.3:confidence 
> rdf:datatype="http://www.w3.org/2001/XMLSchema#double";>1.0</j.3:confidence>
>     <rdf:type 
> rdf:resource="http://fise.iks-project.eu/ontology/TextAnnotation"/>
>     <j.1:creator 
> rdf:datatype="http://www.w3.org/2001/XMLSchema#string";>org.apache.stanbol.enhancer.engines.metaxa.MetaxaEngine</j.1:creator>
>     <j.1:created 
> rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime";>2012-01-12T17:34:20.273Z</j.1:created>
>     <j.3:extracted-from 
> rdf:resource="urn:content-item-sha1-835c8a5397d9b376a268b7bb5d3c8b4ab7e8b81f"/>
>     <rdf:type rdf:resource="http://fise.iks-project.eu/ontology/Enhancement"/>
>   </rdf:Description>
> </rdf:RDF>
> 
> 
> and this is the code:
> 
>       public List<String> Annotate(byte[] _stream_to_annotate, 
> ServiceUtils.MIMETypes _content_type, String _encoding)
>       {        
>               List<String> _return_list = new ArrayList<String>();
>               try             
>               {                          
>                       URL url = new URL(ServiceUtils.SERVICE_URL);            
>                     
>                       HttpURLConnection con = 
> (HttpURLConnection)url.openConnection();                    
>                       con.setDoOutput(true);                               
>                       con.setRequestMethod("POST");                    
>                       con.setRequestProperty("Accept", 
> "application/rdf+xml");                    
>                       con.setRequestProperty("Content-type", 
> _content_type.getValue());
>                                     
>                       java.io.OutputStream out = con.getOutputStream();
>                  
>                       IOUtils.write(_stream_to_annotate, out);                
>    
>                       IOUtils.closeQuietly(out);
>                  
>                       con.connect(); //send the request           
>          
>                       if(con.getResponseCode() > 299)         
>                       { 
>                               java.io.InputStream errorStream = 
> con.getErrorStream();                            
>                               if(errorStream != null)             
>                               {                                 
>                                       String errorMessage = 
> IOUtils.toString(errorStream);                                   
>                                       IOUtils.closeQuietly(errorStream);      
>                 
>                               }              
>                               else              
>                               { 
>                                       //no error data                
>                                       //write default error message with the 
> status code                            
>                               }                    
>                       }          
>                       else                     
>                       {   
>                               Model model = 
> ModelFactory.createDefaultModel();                                            
>    
>                               java.io.InputStream enhancementResults = 
> con.getInputStream();                                                         
>          
>                               model.read(enhancementResults, null);           
>                                         
>                               String queryStringForGraph =  "PREFIX t: 
> <http://fise.iks-project.eu/ontology/> " +
>                                               "SELECT ?label WHERE {?alias 
> t:entity-reference ?label}";                            
>                               Query query = 
> QueryFactory.create(queryStringForGraph);                            
>                               QueryExecution qe = 
> QueryExecutionFactory.create(query, model);                         
>                               
>                               ResultSet results = qe.execSelect();
>                               while(results.hasNext())                        
>     
>                               {                                               
>         
>                                       
> _return_list.add(results.next().toString());
>                               }                                               
>                                           
>                       }                 
>               }                 
>               catch(Exception ex)                            
>               {                 
>                       System.out.println(ex.getMessage());                 
>               }               
>               return _return_list;
>       }
> 
> On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic 
> <[email protected]> wrote:
> 
> Hi Rupert,
> 
> Thank you for the answer. I've probably missed that. 
> 
> Best,
> Srecko
> 
> 
> On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler 
> <[email protected]> wrote:
> Hi Srecko
> 
> I think the last time I directly used this API is about 3-4 years ago, but 
> after a look at the http client tutorial [1] I think the reason for your 
> problem is that you do not execute the GetMethod.
> 
> Based on this tutorial the code should look like
> 
>    // Create an instance of HttpClient.
>    HttpClient client = new HttpClient();
>    GetMethod get = new GetMethod(url);
>    try {
>        // Execute the method.
>        int statusCode = client.executeMethod(get);
>        if (statusCode != HttpStatus.SC_OK) {
>            //handle the error
>        }
>        InputStream t_is = get.getResponseBodyAsStream();
>        //read the data of the stream
>    }
> 
> In addition you should not use a Reader if you want to read byte oriented 
> data from the input stream.
> 
> hope this helps
> best
> Rupert
> 
> [1] http://hc.apache.org/httpclient-3.x/tutorial.html
> 
> On 11.01.2012, at 22:34, Srecko Joksimovic wrote:
> 
> > That's it. Thank you!
> > I have already configured KeywordLinkingEngine when I used my own ontology.
> > I think I'm familiar with that and I will try that option too.
> >
> > In meanwhile I found another interesting problem. I tried to annotate
> > document and web page. With web page, I tried
> > IOUtils.write(byte[], out) and I had to convert URL to byte[]:
> >
> > public static byte[] GetBytesFromURL(String _url) throws IOException
> > {
> >       GetMethod get = new GetMethod(_url);
> >       InputStream t_is = get.getResponseBodyAsStream();
> >       byte[] buffer = new byte[1024];
> >       int count = -1;
> >       Reader t_url_reader = new BufferedReader(new
> > InputStreamReader(t_is));
> >       byte[] t_bytes = IOUtils.toByteArray(t_url_reader, "UTF-8");
> >
> >       return t_bytes;
> > }
> >
> > But, the problem is that I'm getting null for InputStream.
> >
> > Any ideas?
> >
> > Best,
> > Srecko
> >
> >
> >
> > -----Original Message-----
> > From: Rupert Westenthaler [mailto:[email protected]]
> > Sent: Wednesday, January 11, 2012 22:08
> > To: Srecko Joksimovic
> > Cc: [email protected]
> > Subject: Re: Annotating using DBPedia ontology
> >
> >
> > On 11.01.2012, at 21:41, Srecko Joksimovic wrote:
> >> Hi Rupert,
> >>
> >> When I load localhost:8080/engines it says this:
> >>
> >> There are currently 5 active engines.
> >> org.apache.stanbol.enhancer.engines.metaxa.MetaxaEngine
> >> org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine
> >>
> > org.apache.stanbol.enhancer.engines.opennlp.impl.NamedEntityExtractionEnhanc
> >> ementEngine
> >>
> > org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEng
> >> ine
> >>
> > org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEng
> >> ine
> >>
> >> Maybe this could tell you something?
> >>
> >
> > This are exactly the 5 engines that are expected to run with the default
> > configuration.
> > Based on this the Stanbol Enhnacer should just work fine.
> >
> > After looking at the the text you enhanced I noticed however that is does
> > not mention
> > any named entities such as Persons, Organizations and Places. So I checked
> > it with
> > my local Stanbol version and was also not any detected entities.
> >
> > So to check if Stanbol works as expected you should try to use an other text
> > the
> > mentions some Named Entities such as
> >
> >    "John Smith works for the Apple Inc. in Cupertino, California."
> >
> >
> > If you want to search also for entities like "Bank", "Blog", "Consumer",
> > "Telephone" .
> > you need to also configure a KeywordLinkingEngine for dbpedia. Part B or [3]
> > provides
> > more information on how to do that.
> >
> > But let me mention that the KeywordLinkingEngine is more useful if used in
> > combination
> > with an own domain specific thesaurus rather than a global data set like
> > dbpedia. When
> > used with dbpedia you will also get a lot of false positives.
> >
> > best
> > Rupert
> >
> > [3] http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html
> >
> 
> 
>

Re: Annotating using DBPedia ontology

Reply via email to