Yes Andy, the look ahead parser is definitely for a different method, usable by applications as a fallback to be used in exception catching; for durably incorrect sources like dbPedia.
About read RDF/XML , I guess tolerant reading is not difficult ? 2017-01-19 13:10 GMT+01:00 Andy Seaborne <[email protected]>: > > > On 19/01/17 11:47, Jean-Marc Vanel wrote: > >> FIX typo >> >> 2017-01-19 12:45 GMT+01:00 Jean-Marc Vanel <[email protected]>: >> >> There is, however, a possible improvement in Jena parser. >>> It could skip over the faulty triple, and output a non empty graph with >>> the correct ones. >>> >>> Since it should also report the faulty input, I guess this would be in >>> another new method. >>> >>> I guess also that writing a fault tolerant parser is not easy ... >>> >> > Not for Turtle specially when the basic token is broken.(prefix name - > they don't have simple delimiters like <>). > > The code (TokenizerText.readSegment) may be able to deal with some cases > better. > > N-triples would be much easier because of scanning to end of line and the > tokens of the language have delimiters is recovery. > > What to avoid is having to read ahead then reread to parse when in normal, > non-error mode. That will slow down parsing measurably - tokenizing is on > the critical time path for throughput. For example, using JavaCC, which has > a more pwerful (expressive) tokenizer, came out significantly slower (near > 50% for N-triples IIRC). > > Andy > > >>> >>> 2017-01-19 11:23 GMT+01:00 Jean-Marc Vanel <[email protected]>: >>> >>> Vielen Dank Lorenz >>>> >>>> Thanks for the accurate diagnosis and the bug report to Virtuoso. >>>> >>>> I was aware of the issue; >>>> I was testing from my semantic_forms application, >>>> were the exception catching has gone wrong recently (and Scala language >>>> does not require to catch exceptions). >>>> >>>> >>>> >>>> 2017-01-19 10:05 GMT+01:00 Lorenz B. <[email protected] >>>> ig.de> >>>> : >>>> >>>> Hi, >>>>> >>>>> can you clarify what doesn't work? >>>>> >>>>> I tried your example and it would work, but I'm getting a parse >>>>> exception because DBpedia (resp. Virtuoso) still returns illegal data: >>>>> >>>>> Graph g = RDFDataMgr.loadGraph("http://dbpedia.org/resource/Rome"); >>>>> System.out.println(g.size()); >>>>> >>>>> >>>>> [line: 1863, col: 13] Failed to find a prefix name or keyword: >>>>> –(8211;0x2013) >>>>> Exception in thread "main" org.apache.jena.riot.RiotException: [line: >>>>> 1863, col: 13] Failed to find a prefix name or keyword: –(8211;0x2013) >>>>> at >>>>> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandler >>>>> Std.fatal(ErrorHandlerFactory.java:136) >>>>> at >>>>> org.apache.jena.riot.lang.LangEngine.raiseException(LangEngi >>>>> ne.java:165) >>>>> at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.ja >>>>> va:108) >>>>> at >>>>> org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleB >>>>> ase.java:248) >>>>> at >>>>> org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject( >>>>> LangTurtleBase.java:190) >>>>> at >>>>> org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(Lang >>>>> Turtle.java:46) >>>>> at >>>>> org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtl >>>>> eBase.java:89) >>>>> at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42) >>>>> at >>>>> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(R >>>>> DFParserRegistry.java:179) >>>>> at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:861) >>>>> at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:667) >>>>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:212) >>>>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:105) >>>>> at org.apache.jena.riot.RDFDataMgr.loadGraph(RDFDataMgr.java:346) >>>>> >>>>> >>>>> >>>>> >>>>> If there is no parsing error (e.g.http://dbpedia.org/resource/Mars >>>>> works >>>>> for me) the result is as expected: >>>>> >>>>> >>>>> Graph g = RDFDataMgr.loadGraph("http://dbpedia.org/resource/Mars"); >>>>> System.out.println(g.size()); >>>>> >>>>> Output: 347 >>>>> >>>>> >>>>> The issue was already reported by me, see [1], and [2] >>>>> >>>>> [1] https://github.com/openlink/virtuoso-opensource/issues/567 >>>>> [2] https://github.com/openlink/virtuoso-opensource/issues/569 >>>>> >>>>> >>>>> >>>>> Kind regards, >>>>> Lorenz >>>>> >>>>> Reading a dbPedia resource with e.g. RDFDataMgr.loadGraph() does not >>>>>> currently work (it used to work with Jena 3.1.1 ). >>>>>> Apparently this is because of the 303 redirection. >>>>>> Is there another call in Jena API to handle redirections and accepting >>>>>> >>>>> RDF >>>>> >>>>>> MIME types ? >>>>>> >>>>>> wget --save-headers --header='Accept: application/rdf+xml' >>>>>> http://dbpedia.org/resource/Rome >>>>>> --2017-01-19 09:30:08-- http://dbpedia.org/resource/Rome >>>>>> Résolution de dbpedia.org (dbpedia.org)… 194.109.129.58 >>>>>> Connexion à dbpedia.org (dbpedia.org)|194.109.129.58|:80… connecté. >>>>>> requête HTTP transmise, en attente de la réponse… *303 See Other* >>>>>> Emplacement : http://dbpedia.org/data/Rome.xml [suivant] >>>>>> --2017-01-19 09:30:08-- http://dbpedia.org/data/Rome.xml >>>>>> Réutilisation de la connexion existante à dbpedia.org:80. >>>>>> requête HTTP transmise, en attente de la réponse… 200 OK >>>>>> Taille : 1003627 (980K) [application/rdf+xml] >>>>>> Enregistre : «Rome.2» >>>>>> >>>>>> less Rome.2 >>>>>> HTTP/1.1 200 OK >>>>>> Date: Thu, 19 Jan 2017 08:30:08 GMT >>>>>> Content-Type: application/rdf+xml; charset=UTF-8 >>>>>> Content-Length: 1003627 >>>>>> Connection: keep-alive >>>>>> Vary: Accept-Encoding >>>>>> Server: Virtuoso/07.20.3217 (Linux) i686-generic-linux-glibc212-64 >>>>>> >>>>> VDB >>>>> >>>>>> Expires: Thu, 26 Jan 2017 08:30:08 GMT >>>>>> Link: <http://creativecommons.org/licenses/by-sa/3.0/>;rel="licens >>>>>> e",< >>>>>> http://dbpedia.org/data/Rome.n3>; rel="alternate"; type="text/n3"; >>>>>> title="Structured Descriptor Document (N3/Turtle format)", < >>>>>> http://dbpedia.org/data/Rome.json>; rel="alternate"; >>>>>> type="application/json"; title="Structured Descriptor Document >>>>>> >>>>> (RDF/JSON >>>>> >>>>>> format)", <http://dbpedia.org/data/Rome.atom>; rel="alternate"; >>>>>> type="application/atom+xml"; title="OData (Atom+Feed format)", < >>>>>> http://dbpedia.org/data/Rome.jsod>; rel="alternate"; >>>>>> type="application/odata+json"; title="OData (JSON format)", < >>>>>> http://dbpedia.org/page/Rome>; rel="alternate"; type="text/html"; >>>>>> title="XHTML+RDFa", <http://dbpedia.org/resource/Rome>; rel=" >>>>>> http://xmlns.com/foaf/0.1/primaryTopic", < >>>>>> >>>>> http://dbpedia.org/resource/Rome>; >>>>> >>>>>> rev="describedby", < >>>>>> http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbped >>>>>> >>>>> ia.org/data/Rome.xml>; >>>>> >>>>>> rel="timegate" >>>>>> X-SPARQL-default-graph: http://dbpedia.org >>>>>> Cache-Control: max-age=604800 >>>>>> Access-Control-Allow-Origin: * >>>>>> Access-Control-Allow-Credentials: true >>>>>> Access-Control-Allow-Methods: GET, POST, OPTIONS >>>>>> Access-Control-Allow-Headers: >>>>>> DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If >>>>>> >>>>> -Modified-Since,Cache-Control,Content-Type,Accept-Encoding >>>>> >>>>>> Accept-Ranges: bytes >>>>>> >>>>>> <?xml version="1.0" encoding="utf-8" ?> >>>>>> <rdf:RDF >>>>>> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" >>>>>> ... >>>>>> >>>>>> -- >>>>> Lorenz Bühmann >>>>> AKSW group, University of Leipzig >>>>> Group: http://aksw.org - semantic web research center >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Jean-Marc Vanel >>>> Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%2F >>>> jmvanel.free.fr%2Fjmv.rdf%23me >>>> Déductions SARL - Consulting, services, training, >>>> Rule-based programming, Semantic Web >>>> +33 (0)6 89 16 29 52 <+33%206%2089%2016%2029%2052> >>>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui >>>> >>>> >>> >>> >>> -- >>> Jean-Marc Vanel >>> Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F% >>> 2Fjmvanel.free.fr%2Fjmv.rdf%23me >>> Déductions SARL - Consulting, services, training, >>> Rule-based programming, Semantic Web >>> +33 (0)6 89 16 29 52 <+33%206%2089%2016%2029%2052> >>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui >>> >>> >> >> >> -- Jean-Marc Vanel Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%2Fjmvanel.free.fr%2Fjmv.rdf%23me Déductions SARL - Consulting, services, training, Rule-based programming, Semantic Web +33 (0)6 89 16 29 52 Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
