Yes Andy, the look ahead parser is definitely for a different method,
usable by applications as a fallback to be used in exception catching; for
durably incorrect sources like dbPedia.

About read RDF/XML , I guess tolerant reading is not difficult ?



2017-01-19 13:10 GMT+01:00 Andy Seaborne <[email protected]>:

>
>
> On 19/01/17 11:47, Jean-Marc Vanel wrote:
>
>> FIX typo
>>
>> 2017-01-19 12:45 GMT+01:00 Jean-Marc Vanel <[email protected]>:
>>
>> There is, however, a possible improvement in Jena parser.
>>> It could skip over the faulty triple, and output a non empty graph with
>>> the correct ones.
>>>
>>> Since it should also report the faulty input, I guess this would be in
>>> another new method.
>>>
>>> I guess also that writing a fault tolerant parser is not easy ...
>>>
>>
> Not for Turtle specially when the basic token is broken.(prefix name -
> they don't have simple delimiters like <>).
>
> The code (TokenizerText.readSegment) may be able to deal with some cases
> better.
>
> N-triples would be much easier because of scanning to end of line and the
> tokens of the language have delimiters is recovery.
>
> What to avoid is having to read ahead then reread to parse when in normal,
> non-error mode.  That will slow down parsing measurably - tokenizing is on
> the critical time path for throughput. For example, using JavaCC, which has
> a more pwerful (expressive) tokenizer, came out significantly slower (near
> 50% for N-triples IIRC).
>
>         Andy
>
>
>>>
>>> 2017-01-19 11:23 GMT+01:00 Jean-Marc Vanel <[email protected]>:
>>>
>>> Vielen Dank Lorenz
>>>>
>>>> Thanks for the accurate diagnosis and the bug report to Virtuoso.
>>>>
>>>> I was aware of the issue;
>>>> I was testing from my semantic_forms application,
>>>> were the exception catching has gone wrong recently (and Scala language
>>>> does not require to catch exceptions).
>>>>
>>>>
>>>>
>>>> 2017-01-19 10:05 GMT+01:00 Lorenz B. <[email protected]
>>>> ig.de>
>>>> :
>>>>
>>>> Hi,
>>>>>
>>>>> can you clarify what doesn't work?
>>>>>
>>>>> I tried your example and it would work, but I'm getting a parse
>>>>> exception because DBpedia (resp. Virtuoso) still returns illegal data:
>>>>>
>>>>>    Graph g = RDFDataMgr.loadGraph("http://dbpedia.org/resource/Rome";);
>>>>>    System.out.println(g.size());
>>>>>
>>>>>
>>>>> [line: 1863, col: 13] Failed to find a prefix name or keyword:
>>>>> –(8211;0x2013)
>>>>> Exception in thread "main" org.apache.jena.riot.RiotException: [line:
>>>>> 1863, col: 13] Failed to find a prefix name or keyword: –(8211;0x2013)
>>>>>     at
>>>>> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandler
>>>>> Std.fatal(ErrorHandlerFactory.java:136)
>>>>>     at
>>>>> org.apache.jena.riot.lang.LangEngine.raiseException(LangEngi
>>>>> ne.java:165)
>>>>>     at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.ja
>>>>> va:108)
>>>>>     at
>>>>> org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleB
>>>>> ase.java:248)
>>>>>     at
>>>>> org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(
>>>>> LangTurtleBase.java:190)
>>>>>     at
>>>>> org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(Lang
>>>>> Turtle.java:46)
>>>>>     at
>>>>> org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtl
>>>>> eBase.java:89)
>>>>>     at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
>>>>>     at
>>>>> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(R
>>>>> DFParserRegistry.java:179)
>>>>>     at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:861)
>>>>>     at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:667)
>>>>>     at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:212)
>>>>>     at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:105)
>>>>>     at org.apache.jena.riot.RDFDataMgr.loadGraph(RDFDataMgr.java:346)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> If there is no parsing error (e.g.http://dbpedia.org/resource/Mars
>>>>> works
>>>>> for me)  the result is as expected:
>>>>>
>>>>>
>>>>>    Graph g = RDFDataMgr.loadGraph("http://dbpedia.org/resource/Mars";);
>>>>>    System.out.println(g.size());
>>>>>
>>>>>    Output: 347
>>>>>
>>>>>
>>>>> The issue was already reported by me, see [1], and [2]
>>>>>
>>>>> [1] https://github.com/openlink/virtuoso-opensource/issues/567
>>>>> [2] https://github.com/openlink/virtuoso-opensource/issues/569
>>>>>
>>>>>
>>>>>
>>>>> Kind regards,
>>>>> Lorenz
>>>>>
>>>>> Reading a dbPedia resource with e.g. RDFDataMgr.loadGraph() does not
>>>>>> currently work (it used to work with Jena 3.1.1 ).
>>>>>> Apparently this is because of the 303 redirection.
>>>>>> Is there another call in Jena API to handle redirections and accepting
>>>>>>
>>>>> RDF
>>>>>
>>>>>> MIME types ?
>>>>>>
>>>>>> wget --save-headers --header='Accept: application/rdf+xml'
>>>>>> http://dbpedia.org/resource/Rome
>>>>>> --2017-01-19 09:30:08--  http://dbpedia.org/resource/Rome
>>>>>> Résolution de dbpedia.org (dbpedia.org)… 194.109.129.58
>>>>>> Connexion à dbpedia.org (dbpedia.org)|194.109.129.58|:80… connecté.
>>>>>> requête HTTP transmise, en attente de la réponse… *303 See Other*
>>>>>> Emplacement : http://dbpedia.org/data/Rome.xml [suivant]
>>>>>> --2017-01-19 09:30:08--  http://dbpedia.org/data/Rome.xml
>>>>>> Réutilisation de la connexion existante à dbpedia.org:80.
>>>>>> requête HTTP transmise, en attente de la réponse… 200 OK
>>>>>> Taille : 1003627 (980K) [application/rdf+xml]
>>>>>> Enregistre : «Rome.2»
>>>>>>
>>>>>> less Rome.2
>>>>>> HTTP/1.1 200 OK
>>>>>> Date: Thu, 19 Jan 2017 08:30:08 GMT
>>>>>> Content-Type: application/rdf+xml; charset=UTF-8
>>>>>> Content-Length: 1003627
>>>>>> Connection: keep-alive
>>>>>> Vary: Accept-Encoding
>>>>>> Server: Virtuoso/07.20.3217 (Linux) i686-generic-linux-glibc212-64
>>>>>>
>>>>> VDB
>>>>>
>>>>>> Expires: Thu, 26 Jan 2017 08:30:08 GMT
>>>>>> Link: <http://creativecommons.org/licenses/by-sa/3.0/>;rel="licens
>>>>>> e",<
>>>>>> http://dbpedia.org/data/Rome.n3>; rel="alternate"; type="text/n3";
>>>>>> title="Structured Descriptor Document (N3/Turtle format)", <
>>>>>> http://dbpedia.org/data/Rome.json>; rel="alternate";
>>>>>> type="application/json"; title="Structured Descriptor Document
>>>>>>
>>>>> (RDF/JSON
>>>>>
>>>>>> format)", <http://dbpedia.org/data/Rome.atom>; rel="alternate";
>>>>>> type="application/atom+xml"; title="OData (Atom+Feed format)", <
>>>>>> http://dbpedia.org/data/Rome.jsod>; rel="alternate";
>>>>>> type="application/odata+json"; title="OData (JSON format)", <
>>>>>> http://dbpedia.org/page/Rome>; rel="alternate"; type="text/html";
>>>>>> title="XHTML+RDFa", <http://dbpedia.org/resource/Rome>; rel="
>>>>>> http://xmlns.com/foaf/0.1/primaryTopic";, <
>>>>>>
>>>>> http://dbpedia.org/resource/Rome>;
>>>>>
>>>>>> rev="describedby", <
>>>>>> http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbped
>>>>>>
>>>>> ia.org/data/Rome.xml>;
>>>>>
>>>>>> rel="timegate"
>>>>>> X-SPARQL-default-graph: http://dbpedia.org
>>>>>> Cache-Control: max-age=604800
>>>>>> Access-Control-Allow-Origin: *
>>>>>> Access-Control-Allow-Credentials: true
>>>>>> Access-Control-Allow-Methods: GET, POST, OPTIONS
>>>>>> Access-Control-Allow-Headers:
>>>>>> DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If
>>>>>>
>>>>> -Modified-Since,Cache-Control,Content-Type,Accept-Encoding
>>>>>
>>>>>> Accept-Ranges: bytes
>>>>>>
>>>>>> <?xml version="1.0" encoding="utf-8" ?>
>>>>>> <rdf:RDF
>>>>>>         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
>>>>>> ...
>>>>>>
>>>>>> --
>>>>> Lorenz Bühmann
>>>>> AKSW group, University of Leipzig
>>>>> Group: http://aksw.org - semantic web research center
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Jean-Marc Vanel
>>>> Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%2F
>>>> jmvanel.free.fr%2Fjmv.rdf%23me
>>>> Déductions SARL - Consulting, services, training,
>>>> Rule-based programming, Semantic Web
>>>> +33 (0)6 89 16 29 52 <+33%206%2089%2016%2029%2052>
>>>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>>>>
>>>>
>>>
>>>
>>> --
>>> Jean-Marc Vanel
>>> Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%
>>> 2Fjmvanel.free.fr%2Fjmv.rdf%23me
>>> Déductions SARL - Consulting, services, training,
>>> Rule-based programming, Semantic Web
>>> +33 (0)6 89 16 29 52 <+33%206%2089%2016%2029%2052>
>>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>>>
>>>
>>
>>
>>


-- 
Jean-Marc Vanel
Profil:
http://163.172.179.125:9111/display?displayuri=http%3A%2F%2Fjmvanel.free.fr%2Fjmv.rdf%23me
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui

Reply via email to