Hi Walter, Word document is nothing special, just one line of text:
"John Smith works for the Apple Inc. in Cupertino, California." Rupert suggested this sentence in order to test text annotation. As I now result after annotating this string, I thought to create Word document with same content for test purposes. The error with your HTML page apparently arises from a bug in resolving relative URLs in one of the HTML extractors. We will fix that. Does it means that I can't annotate HTML page at this moment, or that depends on page to page basis? Best, Srecko On Fri, Jan 13, 2012 at 9:51 AM, Walter Kasper <[email protected]> wrote: > Hi Srecko, > > I don't know what the problem with your Word document could have been. > Could you send it to me for testing? > > The error with your HTML page apparently arises from a bug in resolving > relative URLs in one of the HTML extractors. We will fix that. > > Best regards, > > Walter > > > Srecko Joksimovic wrote: > >> Thank you Rupert! >> >> It is probably something that I missed. >> >> Best, >> Srecko >> >> -----Original Message----- >> From: Rupert Westenthaler >> [mailto:rupert.westenthaler@**gmail.com<[email protected]> >> ] >> Sent: Thursday, January 12, 2012 20:16 >> To: Srecko Joksimovic; [email protected] >> Cc: [email protected].**org <[email protected]> >> Subject: Re: Annotating using DBPedia ontology >> >> Hi Srecko >> >> I seams that both cases are related to the Metaxa Engine. My knowledge >> abut >> the libs used by this engine to extract the textual content is very >> limited. >> So I might not be the right person to look into that. >> >> In the first Example I think Metaxa was not able to extract the text from >> the word document because the only plainTextContent triple noted is >> >> <j.0:plainTextContent>**Microsoft Word-Dokument
 >> srecko</j.0:plainTextContent> >> >> The second example looks like an issue within the RDF metadata generation >> in Aperture. >> >> I sent this replay also directly to Walter Kasper. He is the one who >> contributed this engine and should be able to provide a more information. >> >> best >> Rupert >> >> On 12.01.2012, at 18:40, srecko joksimovic wrote: >> >> Hi Rupert, >>> >>> I have another question, and I will finish soon. >>> >>> I tried to annotate pdf document, and I didn't get result I expected. >>> Then >>> >> I put string you sent to me >> >>> "John Smith works for the Apple Inc. in Cupertino, California." >>> in MS Word document, and this is the result I got: >>> >>> <rdf:RDF >>> >>> xmlns:rdf="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>> " >>> xmlns:j.0="http://www.**semanticdesktop.org/** >>> ontologies/2007/01/19/nie#<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#> >>> " >>> xmlns:j.1="http://purl.org/dc/**terms/ <http://purl.org/dc/terms/>" >>> xmlns:j.2="http://www.**semanticdesktop.org/** >>> ontologies/2007/03/22/nfo#<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#> >>> " >>> >>> xmlns:j.3="http://fise.iks-**project.eu/ontology/<http://fise.iks-project.eu/ontology/> >>> "> >>> <rdf:Description >>> >> rdf:about="urn:enhancement-**55016818-eb97-7b98-521a-**422e3742173b"> >> >>> <rdf:type >>> >> rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation> >> "/> >> >>> <j.1:creator >>> >> rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string> >> ">**org.apache.stanbol.en >> hancer.engines.langid.**LangIdEnhancementEngine</j.1:**creator> >> >>> <j.1:created >>> >> rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime> >> ">**2012-01-12T17:34:20 >> .288Z</j.1:created> >> >>> <j.3:extracted-from >>> >> rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b** >> 4ab7e8b81f >> "/> >> >>> <rdf:type >>> >> rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement> >> "/> >> >>> <j.1:language>fr</j.1:**language> >>> </rdf:Description> >>> <rdf:Description >>> >> rdf:about="urn:content-item-**sha1-**835c8a5397d9b376a268b7bb5d3c8b** >> 4ab7e8b81f"> >> >>> <rdf:type >>> >> rdf:resource="http://www.**semanticdesktop.org/** >> ontologies/2007/03/22/nfo#**Pagin<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Pagin> >> atedTextDocument"/> >> >>> <j.0:plainTextContent>**Microsoft Word-Dokument
 >>> srecko</j.0:plainTextContent> >>> </rdf:Description> >>> <rdf:Description >>> >> rdf:about="urn:enhancement-**0644a1ed-f1d8-334d-d4e9-**690a0446cba8"> >> >>> <j.3:confidence >>> >> rdf:datatype="http://www.w3.**org/2001/XMLSchema#double<http://www.w3.org/2001/XMLSchema#double> >> ">1.**0</j.3:confidence> >> >>> <rdf:type >>> >> rdf:resource="http://fise.iks-**project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation> >> "/> >> >>> <j.1:creator >>> >> rdf:datatype="http://www.w3.**org/2001/XMLSchema#string<http://www.w3.org/2001/XMLSchema#string> >> ">**org.apache.stanbol.en >> hancer.engines.metaxa.**MetaxaEngine</j.1:creator> >> >>> <j.1:created >>> >> rdf:datatype="http://www.w3.**org/2001/XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime> >> ">**2012-01-12T17:34:20 >> .273Z</j.1:created> >> >>> <j.3:extracted-from >>> >> rdf:resource="urn:content-**item-sha1-**835c8a5397d9b376a268b7bb5d3c8b** >> 4ab7e8b81f >> "/> >> >>> <rdf:type >>> >> rdf:resource="http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement> >> "/> >> >>> </rdf:Description> >>> </rdf:RDF> >>> >>> >>> and this is the code: >>> >>> public List<String> Annotate(byte[] _stream_to_annotate, >>> >> ServiceUtils.MIMETypes _content_type, String _encoding) >> >>> { >>> List<String> _return_list = new ArrayList<String>(); >>> try >>> { >>> URL url = new URL(ServiceUtils.SERVICE_URL); >>> HttpURLConnection con = >>> >> (HttpURLConnection)url.**openConnection(); >> >>> con.setDoOutput(true); >>> con.setRequestMethod("POST"); >>> con.setRequestProperty("**Accept", >>> >> "application/rdf+xml"); >> >>> con.setRequestProperty("**Content-type", >>> >> _content_type.getValue()); >> >>> >>> java.io.OutputStream out = con.getOutputStream(); >>> >>> IOUtils.write(_stream_to_**annotate, out); >>> IOUtils.closeQuietly(out); >>> >>> con.connect(); //send the request >>> >>> if(con.getResponseCode()> 299) >>> { >>> java.io.InputStream errorStream = >>> >> con.getErrorStream(); >> >>> if(errorStream != null) >>> { >>> String errorMessage = >>> >> IOUtils.toString(errorStream); >> >>> IOUtils.closeQuietly(** >>> errorStream); >>> } >>> else >>> { >>> //no error data >>> //write default error message with >>> >> the status code >> >>> } >>> } >>> else >>> { >>> Model model = >>> >> ModelFactory.**createDefaultModel(); >> >> java.io.InputStream enhancementResults = >>> >> con.getInputStream(); >> >> model.read(enhancementResults, null); >>> String queryStringForGraph = "PREFIX t: >>> >> <http://fise.iks-project.eu/**ontology/<http://fise.iks-project.eu/ontology/>> >> " + >> >>> "SELECT ?label WHERE >>> {?alias >>> >> t:entity-reference ?label}"; >> >>> Query query = >>> >> QueryFactory.create(**queryStringForGraph); >> >>> QueryExecution qe = >>> >> QueryExecutionFactory.create(**query, model); >> >> >>> >>> ResultSet results = qe.execSelect(); >>> while(results.hasNext()) >>> { >>> >> _return_list.add(results.next(**).toString()); >> >>> } >>> } >>> } >>> catch(Exception ex) >>> { >>> System.out.println(ex.**getMessage()); >>> } >>> return _return_list; >>> } >>> >>> On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic >>> >> <[email protected]> wrote: >> >>> Hi Rupert, >>> >>> Thank you for the answer. I've probably missed that. >>> >>> Best, >>> Srecko >>> >>> >>> On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler >>> >> <[email protected]**> wrote: >> >>> Hi Srecko >>> >>> I think the last time I directly used this API is about 3-4 years ago, >>> but >>> >> after a look at the http client tutorial [1] I think the reason for your >> problem is that you do not execute the GetMethod. >> >>> Based on this tutorial the code should look like >>> >>> // Create an instance of HttpClient. >>> HttpClient client = new HttpClient(); >>> GetMethod get = new GetMethod(url); >>> try { >>> // Execute the method. >>> int statusCode = client.executeMethod(get); >>> if (statusCode != HttpStatus.SC_OK) { >>> //handle the error >>> } >>> InputStream t_is = get.getResponseBodyAsStream(); >>> //read the data of the stream >>> } >>> >>> In addition you should not use a Reader if you want to read byte oriented >>> >> data from the input stream. >> >>> hope this helps >>> best >>> Rupert >>> >>> [1] >>> http://hc.apache.org/**httpclient-3.x/tutorial.html<http://hc.apache.org/httpclient-3.x/tutorial.html> >>> >>> On 11.01.2012, at 22:34, Srecko Joksimovic wrote: >>> >>> That's it. Thank you! >>>> I have already configured KeywordLinkingEngine when I used my own >>>> >>> ontology. >> >>> I think I'm familiar with that and I will try that option too. >>>> >>>> In meanwhile I found another interesting problem. I tried to annotate >>>> document and web page. With web page, I tried >>>> IOUtils.write(byte[], out) and I had to convert URL to byte[]: >>>> >>>> public static byte[] GetBytesFromURL(String _url) throws IOException >>>> { >>>> GetMethod get = new GetMethod(_url); >>>> InputStream t_is = get.getResponseBodyAsStream(); >>>> byte[] buffer = new byte[1024]; >>>> int count = -1; >>>> Reader t_url_reader = new BufferedReader(new >>>> InputStreamReader(t_is)); >>>> byte[] t_bytes = IOUtils.toByteArray(t_url_**reader, "UTF-8"); >>>> >>>> return t_bytes; >>>> } >>>> >>>> But, the problem is that I'm getting null for InputStream. >>>> >>>> Any ideas? >>>> >>>> Best, >>>> Srecko >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Rupert Westenthaler >>>> [mailto:rupert.westenthaler@**gmail.com<[email protected]> >>>> ] >>>> Sent: Wednesday, January 11, 2012 22:08 >>>> To: Srecko Joksimovic >>>> Cc: [email protected].**org<[email protected]> >>>> Subject: Re: Annotating using DBPedia ontology >>>> >>>> >>>> On 11.01.2012, at 21:41, Srecko Joksimovic wrote: >>>> >>>>> Hi Rupert, >>>>> >>>>> When I load localhost:8080/engines it says this: >>>>> >>>>> There are currently 5 active engines. >>>>> org.apache.stanbol.enhancer.**engines.metaxa.MetaxaEngine >>>>> org.apache.stanbol.enhancer.**engines.langid.**LangIdEnhancementEngine >>>>> >>>>> org.apache.stanbol.enhancer.**engines.opennlp.impl.** >> NamedEntityExtractionEnhanc >> >>> ementEngine >>>>> >>>>> org.apache.stanbol.enhancer.**engines.entitytagging.impl.** >> NamedEntityTaggingEng >> >>> ine >>>>> >>>>> org.apache.stanbol.enhancer.**engines.entitytagging.impl.** >> NamedEntityTaggingEng >> >>> ine >>>>> >>>>> Maybe this could tell you something? >>>>> >>>>> This are exactly the 5 engines that are expected to run with the >>>> default >>>> configuration. >>>> Based on this the Stanbol Enhnacer should just work fine. >>>> >>>> After looking at the the text you enhanced I noticed however that is >>>> >>> does >> >>> not mention >>>> any named entities such as Persons, Organizations and Places. So I >>>> >>> checked >> >>> it with >>>> my local Stanbol version and was also not any detected entities. >>>> >>>> So to check if Stanbol works as expected you should try to use an other >>>> >>> text >> >>> the >>>> mentions some Named Entities such as >>>> >>>> "John Smith works for the Apple Inc. in Cupertino, California." >>>> >>>> >>>> If you want to search also for entities like "Bank", "Blog", "Consumer", >>>> "Telephone" . >>>> you need to also configure a KeywordLinkingEngine for dbpedia. Part B or >>>> >>> [3] >> >>> provides >>>> more information on how to do that. >>>> >>>> But let me mention that the KeywordLinkingEngine is more useful if used >>>> >>> in >> >>> combination >>>> with an own domain specific thesaurus rather than a global data set like >>>> dbpedia. When >>>> used with dbpedia you will also get a lot of false positives. >>>> >>>> best >>>> Rupert >>>> >>>> [3] http://incubator.apache.org/**stanbol/docs/trunk/** >>>> customvocabulary.html<http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html> >>>> >>>> >>> >>> > >
