Hi, Thank you! I will checkout the last version. I'm using application/msword, because I thought that is the right one. Could you please send me correct formats for pdf, txt, ppt, xls and odt formats?
Best, Srecko On Fri, Jan 13, 2012 at 1:34 PM, Walter Kasper <[email protected]> wrote: > Hi, > > We fixed the problem with unresolved relative URL from HTML documents. In > the case of your Wikipedia page it came from an embedded rel-license > microformat. If you are interested only in text extraction you can also > just disable the RDFa and Microformat extractors in the configuration for > the html extraction. > > We tested also Word documents with your test sentence. Everything worked > fine for us. Did you use the correct mime type? The correct ones for Word > documents are: > > doc-Format (<= Word-2003): application/vnd.ms-word > docx-Format (Word-2007): application/vnd.**openxmlformats-officedocument.* > *wordprocessingml > > Best regards, > > Walter > > srecko joksimovic wrote: > >> Hi Walter, >> >> Word document is nothing special, just one line of text: >> >> "John Smith works for the Apple Inc. in Cupertino, California." >> >> Rupert suggested this sentence in order to test text annotation. As I now >> result after annotating this string, I thought to create Word document >> with >> same content for test purposes. >> >> The error with your HTML page apparently arises from a bug in resolving >> relative URLs in one of the HTML extractors. We will fix that. >> >> Does it means that I can't annotate HTML page at this moment, or that >> depends on page to page basis? >> >> Best, >> Srecko >> >> On Fri, Jan 13, 2012 at 9:51 AM, Walter Kasper<[email protected]> >> wrote: >> >> Hi Srecko, >>> >>> I don't know what the problem with your Word document could have been. >>> Could you send it to me for testing? >>> >>> The error with your HTML page apparently arises from a bug in resolving >>> relative URLs in one of the HTML extractors. We will fix that. >>> >>> Best regards, >>> >>> Walter >>> >>> >>> Srecko Joksimovic wrote: >>> >>> Thank you Rupert! >>>> >>>> It is probably something that I missed. >>>> >>>> Best, >>>> Srecko >>>> >>>> -----Original Message----- >>>> From: Rupert Westenthaler >>>> [mailto:rupert.westenthaler@****gmail.com<http://gmail.com> >>>> <rupert.westenthaler@**gmail.com <[email protected]>> >>>> ] >>>> Sent: Thursday, January 12, 2012 20:16 >>>> To: Srecko Joksimovic; [email protected] >>>> Cc: [email protected].****org<stanbol-dev@incubator.** >>>> apache.org <[email protected]>> >>>> Subject: Re: Annotating using DBPedia ontology >>>> >>>> Hi Srecko >>>> >>>> I seams that both cases are related to the Metaxa Engine. My knowledge >>>> abut >>>> the libs used by this engine to extract the textual content is very >>>> limited. >>>> So I might not be the right person to look into that. >>>> >>>> In the first Example I think Metaxa was not able to extract the text >>>> from >>>> the word document because the only plainTextContent triple noted is >>>> >>>> <j.0:plainTextContent>****Microsoft Word-Dokument
 >>>> >>>> srecko</j.0:plainTextContent> >>>> >>>> The second example looks like an issue within the RDF metadata >>>> generation >>>> in Aperture. >>>> >>>> I sent this replay also directly to Walter Kasper. He is the one who >>>> contributed this engine and should be able to provide a more >>>> information. >>>> >>>> best >>>> Rupert >>>> >>>> On 12.01.2012, at 18:40, srecko joksimovic wrote: >>>> >>>> Hi Rupert, >>>> >>>>> I have another question, and I will finish soon. >>>>> >>>>> I tried to annotate pdf document, and I didn't get result I expected. >>>>> Then >>>>> >>>>> I put string you sent to me >>>> >>>> "John Smith works for the Apple Inc. in Cupertino, California." >>>>> in MS Word document, and this is the result I got: >>>>> >>>>> <rdf:RDF >>>>> >>>>> xmlns:rdf="http://www.w3.org/****1999/02/22-rdf-syntax-ns#<http://www.w3.org/**1999/02/22-rdf-syntax-ns#> >>>>> <htt**p://www.w3.org/1999/02/22-rdf-**syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>>>> > >>>>> " >>>>> >>>>> xmlns:j.0="http://www.**semant**icdesktop.org/**<http://semanticdesktop.org/**> >>>>> ontologies/2007/01/19/nie#<htt**p://www.semanticdesktop.org/** >>>>> ontologies/2007/01/19/nie#<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#> >>>>> > >>>>> " >>>>> xmlns:j.1="http://purl.org/dc/****terms/<http://purl.org/dc/**terms/> >>>>> <http://purl.org/dc/**terms/ <http://purl.org/dc/terms/>>" >>>>> >>>>> xmlns:j.2="http://www.**semant**icdesktop.org/**<http://semanticdesktop.org/**> >>>>> ontologies/2007/03/22/nfo#<htt**p://www.semanticdesktop.org/** >>>>> ontologies/2007/03/22/nfo#<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#> >>>>> > >>>>> " >>>>> >>>>> xmlns:j.3="http://fise.iks-**p**roject.eu/ontology/<http://project.eu/ontology/> >>>>> <http://**fise.iks-project.eu/ontology/<http://fise.iks-project.eu/ontology/> >>>>> > >>>>> "> >>>>> <rdf:Description >>>>> >>>>> rdf:about="urn:enhancement-****55016818-eb97-7b98-521a-**** >>>> 422e3742173b"> >>>> >>>> <rdf:type >>>>> >>>>> rdf:resource="http://fise.iks-****project.eu/ontology/**** >>>> TextAnnotation <http://project.eu/ontology/**TextAnnotation>< >>>> http://fise.**iks-project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation> >>>> > >>>> "/> >>>> >>>> <j.1:creator >>>>> >>>>> rdf:datatype="http://www.w3.****org/2001/XMLSchema#string<http** >>>> ://www.w3.org/2001/XMLSchema#**string<http://www.w3.org/2001/XMLSchema#string> >>>> > >>>> ">**org.apache.stanbol.en >>>> hancer.engines.langid.****LangIdEnhancementEngine</j.1:****creator> >>>> >>>> <j.1:created >>>>> >>>>> rdf:datatype="http://www.w3.****org/2001/XMLSchema#dateTime<ht** >>>> tp://www.w3.org/2001/**XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime> >>>> > >>>> ">**2012-01-12T17:34:20 >>>> >>>> .288Z</j.1:created> >>>> >>>> <j.3:extracted-from >>>>> >>>>> rdf:resource="urn:content-****item-sha1-**** >>>> 835c8a5397d9b376a268b7bb5d3c8b**** >>>> 4ab7e8b81f >>>> "/> >>>> >>>> <rdf:type >>>>> >>>>> >>>>> rdf:resource="http://fise.iks-****project.eu/ontology/****Enhancement<http://project.eu/ontology/**Enhancement> >>>> <http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement> >>>> > >>>> "/> >>>> >>>> <j.1:language>fr</j.1:****language> >>>>> </rdf:Description> >>>>> <rdf:Description >>>>> >>>>> rdf:about="urn:content-item-****sha1-**** >>>> 835c8a5397d9b376a268b7bb5d3c8b**** >>>> 4ab7e8b81f"> >>>> >>>> <rdf:type >>>>> >>>>> >>>>> rdf:resource="http://www.**sem**anticdesktop.org/**<http://semanticdesktop.org/**> >>>> ontologies/2007/03/22/nfo#****Pagin<http://www.**semanticdesktop.org/** >>>> ontologies/2007/03/22/nfo#**Pagin<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Pagin> >>>> > >>>> atedTextDocument"/> >>>> >>>> <j.0:plainTextContent>****Microsoft Word-Dokument
 >>>>> >>>>> srecko</j.0:plainTextContent> >>>>> </rdf:Description> >>>>> <rdf:Description >>>>> >>>>> rdf:about="urn:enhancement-****0644a1ed-f1d8-334d-d4e9-**** >>>> 690a0446cba8"> >>>> >>>> <j.3:confidence >>>>> >>>>> rdf:datatype="http://www.w3.****org/2001/XMLSchema#double<http** >>>> ://www.w3.org/2001/XMLSchema#**double<http://www.w3.org/2001/XMLSchema#double> >>>> > >>>> ">1.**0</j.3:confidence> >>>> >>>> <rdf:type >>>>> >>>>> rdf:resource="http://fise.iks-****project.eu/ontology/**** >>>> TextAnnotation <http://project.eu/ontology/**TextAnnotation>< >>>> http://fise.**iks-project.eu/ontology/**TextAnnotation<http://fise.iks-project.eu/ontology/TextAnnotation> >>>> > >>>> "/> >>>> >>>> <j.1:creator >>>>> >>>>> rdf:datatype="http://www.w3.****org/2001/XMLSchema#string<http** >>>> ://www.w3.org/2001/XMLSchema#**string<http://www.w3.org/2001/XMLSchema#string> >>>> > >>>> ">**org.apache.stanbol.en >>>> hancer.engines.metaxa.****MetaxaEngine</j.1:creator> >>>> >>>> <j.1:created >>>>> >>>>> rdf:datatype="http://www.w3.****org/2001/XMLSchema#dateTime<ht** >>>> tp://www.w3.org/2001/**XMLSchema#dateTime<http://www.w3.org/2001/XMLSchema#dateTime> >>>> > >>>> ">**2012-01-12T17:34:20 >>>> >>>> .273Z</j.1:created> >>>> >>>> <j.3:extracted-from >>>>> >>>>> rdf:resource="urn:content-****item-sha1-**** >>>> 835c8a5397d9b376a268b7bb5d3c8b**** >>>> 4ab7e8b81f >>>> "/> >>>> >>>> <rdf:type >>>>> >>>>> >>>>> rdf:resource="http://fise.iks-****project.eu/ontology/****Enhancement<http://project.eu/ontology/**Enhancement> >>>> <http://fise.iks-**project.eu/ontology/**Enhancement<http://fise.iks-project.eu/ontology/Enhancement> >>>> > >>>> >>>> "/> >>>> >>>> </rdf:Description> >>>>> </rdf:RDF> >>>>> >>>>> >>>>> and this is the code: >>>>> >>>>> public List<String> Annotate(byte[] _stream_to_annotate, >>>>> >>>>> ServiceUtils.MIMETypes _content_type, String _encoding) >>>> >>>> { >>>>> List<String> _return_list = new ArrayList<String>(); >>>>> try >>>>> { >>>>> URL url = new URL(ServiceUtils.SERVICE_URL); >>>>> HttpURLConnection con = >>>>> >>>>> (HttpURLConnection)url.****openConnection(); >>>> >>>> con.setDoOutput(true); >>>>> con.setRequestMethod("POST"); >>>>> con.setRequestProperty("****Accept", >>>>> >>>>> "application/rdf+xml"); >>>> >>>> con.setRequestProperty("****Content-type", >>>>> >>>>> _content_type.getValue()); >>>> >>>> java.io.OutputStream out = >>>>> con.getOutputStream(); >>>>> >>>>> IOUtils.write(_stream_to_****annotate, out); >>>>> >>>>> IOUtils.closeQuietly(out); >>>>> >>>>> con.connect(); //send the request >>>>> >>>>> if(con.getResponseCode()> 299) >>>>> { >>>>> java.io.InputStream errorStream = >>>>> >>>>> con.getErrorStream(); >>>> >>>> if(errorStream != null) >>>>> { >>>>> String errorMessage = >>>>> >>>>> IOUtils.toString(errorStream); >>>> >>>> IOUtils.closeQuietly(** >>>>> >>>>> errorStream); >>>>> } >>>>> else >>>>> { >>>>> //no error data >>>>> //write default error message >>>>> with >>>>> >>>>> the status code >>>> >>>> } >>>>> } >>>>> else >>>>> { >>>>> Model model = >>>>> >>>>> ModelFactory.****createDefaultModel(); >>>> >>>> >>>> java.io.InputStream enhancementResults = >>>> con.getInputStream(); >>>> >>>> model.read(enhancementResults, null); >>>> >>>>> String queryStringForGraph = "PREFIX t: >>>>> >>>>> >>>>> <http://fise.iks-project.eu/****ontology/<http://fise.iks-project.eu/**ontology/> >>>> <http://fise.iks-**project.eu/ontology/<http://fise.iks-project.eu/ontology/> >>>> >> >>>> >>>> " + >>>> >>>> "SELECT ?label WHERE >>>>> {?alias >>>>> >>>>> t:entity-reference ?label}"; >>>> >>>> Query query = >>>>> >>>>> QueryFactory.create(****queryStringForGraph); >>>> >>>> QueryExecution qe = >>>>> >>>>> QueryExecutionFactory.create(****query, model); >>>> >>>> >>>> >>>> ResultSet results = qe.execSelect(); >>>>> while(results.hasNext()) >>>>> { >>>>> >>>>> _return_list.add(results.next(****).toString()); >>>> >>>> } >>>>> } >>>>> } >>>>> catch(Exception ex) >>>>> { >>>>> System.out.println(ex.****getMessage()); >>>>> >>>>> } >>>>> return _return_list; >>>>> } >>>>> >>>>> On Thu, Jan 12, 2012 at 8:32 AM, srecko joksimovic >>>>> >>>>> <[email protected]> wrote: >>>> >>>> Hi Rupert, >>>>> >>>>> Thank you for the answer. I've probably missed that. >>>>> >>>>> Best, >>>>> Srecko >>>>> >>>>> >>>>> On Thu, Jan 12, 2012 at 6:12 AM, Rupert Westenthaler >>>>> >>>>> <[email protected]****> wrote: >>>> >>>> Hi Srecko >>>>> >>>>> I think the last time I directly used this API is about 3-4 years ago, >>>>> but >>>>> >>>>> after a look at the http client tutorial [1] I think the reason for >>>> your >>>> problem is that you do not execute the GetMethod. >>>> >>>> Based on this tutorial the code should look like >>>>> >>>>> // Create an instance of HttpClient. >>>>> HttpClient client = new HttpClient(); >>>>> GetMethod get = new GetMethod(url); >>>>> try { >>>>> // Execute the method. >>>>> int statusCode = client.executeMethod(get); >>>>> if (statusCode != HttpStatus.SC_OK) { >>>>> //handle the error >>>>> } >>>>> InputStream t_is = get.getResponseBodyAsStream(); >>>>> //read the data of the stream >>>>> } >>>>> >>>>> In addition you should not use a Reader if you want to read byte >>>>> oriented >>>>> >>>>> data from the input stream. >>>> >>>> hope this helps >>>>> best >>>>> Rupert >>>>> >>>>> [1] >>>>> http://hc.apache.org/****httpclient-3.x/tutorial.html<http://hc.apache.org/**httpclient-3.x/tutorial.html> >>>>> <h**ttp://hc.apache.org/**httpclient-3.x/tutorial.html<http://hc.apache.org/httpclient-3.x/tutorial.html> >>>>> > >>>>> >>>>> >>>>> On 11.01.2012, at 22:34, Srecko Joksimovic wrote: >>>>> >>>>> That's it. Thank you! >>>>> >>>>>> I have already configured KeywordLinkingEngine when I used my own >>>>>> >>>>>> ontology. >>>>> I think I'm familiar with that and I will try that option too. >>>>> >>>>>> In meanwhile I found another interesting problem. I tried to annotate >>>>>> document and web page. With web page, I tried >>>>>> IOUtils.write(byte[], out) and I had to convert URL to byte[]: >>>>>> >>>>>> public static byte[] GetBytesFromURL(String _url) throws IOException >>>>>> { >>>>>> GetMethod get = new GetMethod(_url); >>>>>> InputStream t_is = get.getResponseBodyAsStream(); >>>>>> byte[] buffer = new byte[1024]; >>>>>> int count = -1; >>>>>> Reader t_url_reader = new BufferedReader(new >>>>>> InputStreamReader(t_is)); >>>>>> byte[] t_bytes = IOUtils.toByteArray(t_url_****reader, >>>>>> "UTF-8"); >>>>>> >>>>>> >>>>>> return t_bytes; >>>>>> } >>>>>> >>>>>> But, the problem is that I'm getting null for InputStream. >>>>>> >>>>>> Any ideas? >>>>>> >>>>>> Best, >>>>>> Srecko >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Rupert Westenthaler >>>>>> [mailto:rupert.westenthaler@****gmail.com<http://gmail.com> >>>>>> <rupert.westenthaler@**gmail.com <[email protected]>> >>>>>> ] >>>>>> Sent: Wednesday, January 11, 2012 22:08 >>>>>> To: Srecko Joksimovic >>>>>> Cc: [email protected].****org<stanbol-dev@incubator.** >>>>>> apache.org <[email protected]>> >>>>>> Subject: Re: Annotating using DBPedia ontology >>>>>> >>>>>> >>>>>> On 11.01.2012, at 21:41, Srecko Joksimovic wrote: >>>>>> >>>>>> Hi Rupert, >>>>>>> >>>>>>> When I load localhost:8080/engines it says this: >>>>>>> >>>>>>> There are currently 5 active engines. >>>>>>> org.apache.stanbol.enhancer.****engines.metaxa.MetaxaEngine >>>>>>> org.apache.stanbol.enhancer.****engines.langid.**** >>>>>>> LangIdEnhancementEngine >>>>>>> >>>>>>> org.apache.stanbol.enhancer.****engines.opennlp.impl.** >>>>>>> >>>>>> NamedEntityExtractionEnhanc >>>> >>>> ementEngine >>>>> >>>>>> org.apache.stanbol.enhancer.****engines.entitytagging.impl.** >>>>>>> >>>>>> NamedEntityTaggingEng >>>> >>>> ine >>>>> >>>>>> org.apache.stanbol.enhancer.****engines.entitytagging.impl.** >>>>>>> >>>>>> NamedEntityTaggingEng >>>> >>>> ine >>>>> >>>>>> Maybe this could tell you something? >>>>>>> >>>>>>> This are exactly the 5 engines that are expected to run with the >>>>>>> >>>>>> default >>>>>> configuration. >>>>>> Based on this the Stanbol Enhnacer should just work fine. >>>>>> >>>>>> After looking at the the text you enhanced I noticed however that is >>>>>> >>>>>> does >>>>> not mention >>>>> >>>>>> any named entities such as Persons, Organizations and Places. So I >>>>>> >>>>>> checked >>>>> it with >>>>> >>>>>> my local Stanbol version and was also not any detected entities. >>>>>> >>>>>> So to check if Stanbol works as expected you should try to use an >>>>>> other >>>>>> >>>>>> text >>>>> the >>>>> >>>>>> mentions some Named Entities such as >>>>>> >>>>>> "John Smith works for the Apple Inc. in Cupertino, California." >>>>>> >>>>>> >>>>>> If you want to search also for entities like "Bank", "Blog", >>>>>> "Consumer", >>>>>> "Telephone" . >>>>>> you need to also configure a KeywordLinkingEngine for dbpedia. Part B >>>>>> or >>>>>> >>>>>> [3] >>>>> provides >>>>> >>>>>> more information on how to do that. >>>>>> >>>>>> But let me mention that the KeywordLinkingEngine is more useful if >>>>>> used >>>>>> >>>>>> in >>>>> combination >>>>> >>>>>> with an own domain specific thesaurus rather than a global data set >>>>>> like >>>>>> dbpedia. When >>>>>> used with dbpedia you will also get a lot of false positives. >>>>>> >>>>>> best >>>>>> Rupert >>>>>> >>>>>> [3] >>>>>> http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**> >>>>>> customvocabulary.html<http://**incubator.apache.org/stanbol/** >>>>>> docs/trunk/customvocabulary.**html<http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html> >>>>>> > >>>>>> >>>>>> >>>>>> >>>>> >>> > > >
