On 19/01/17 12:16, Jean-Marc Vanel wrote:
Yes Andy, the look ahead parser is definitely for a different method,
usable by applications as a fallback to be used in exception catching; for
durably incorrect sources like dbPedia.

About read RDF/XML , I guess tolerant reading is not difficult ?

:-)

errors happen at different level: you had a token-level error (bad prefix name) which needs skipping to some recovery point. Other kinds of errors can be handled as addition grammar rules, these give nice error messages.

For RDF/XML, Jena uses a standard XML parser then has a grammar for RDF over that.

If the XML is broken, we are entirely dependent on the XML parser (Apache Xerces, not the forked on in the JDK). XML parsers tend to be written to be careful and correct, not recovery (think business process documents - any error is a problem). If a tag is wrong, they don't recover.

Errors in the RDF rules over parsed XML are done - they tend to be "warning".

Is it "not difficult?" Well, do you want to write an XML parser? :-) its a fairly simple language - it's dealing with the details of character sets, entities, and providing the standard API.

Jena uses SAX for streaming - recovering from tag errors is hard in SAX - e.g. the parse has seen the start <tag> ... then finds a mismatch with </othertag>. But the API is based on pairing start-finish tag.


Aside, you can ask for a different format and hope that's not broken - I thought that DBpedia parsed all their files for correctness these days but it seems they don't.

Unicode x2013 is en-dash - the Turtle grammar goes back to XML's NameChar production which is what derived SPARQL and Turtle. IIRC its because is looks like a plain dash hence XML security worries.

        Andy




2017-01-19 13:10 GMT+01:00 Andy Seaborne <[email protected]>:



On 19/01/17 11:47, Jean-Marc Vanel wrote:

FIX typo

2017-01-19 12:45 GMT+01:00 Jean-Marc Vanel <[email protected]>:

There is, however, a possible improvement in Jena parser.
It could skip over the faulty triple, and output a non empty graph with
the correct ones.

Since it should also report the faulty input, I guess this would be in
another new method.

I guess also that writing a fault tolerant parser is not easy ...


Not for Turtle specially when the basic token is broken.(prefix name -
they don't have simple delimiters like <>).

The code (TokenizerText.readSegment) may be able to deal with some cases
better.

N-triples would be much easier because of scanning to end of line and the
tokens of the language have delimiters is recovery.

What to avoid is having to read ahead then reread to parse when in normal,
non-error mode.  That will slow down parsing measurably - tokenizing is on
the critical time path for throughput. For example, using JavaCC, which has
a more pwerful (expressive) tokenizer, came out significantly slower (near
50% for N-triples IIRC).

        Andy



2017-01-19 11:23 GMT+01:00 Jean-Marc Vanel <[email protected]>:

Vielen Dank Lorenz

Thanks for the accurate diagnosis and the bug report to Virtuoso.

I was aware of the issue;
I was testing from my semantic_forms application,
were the exception catching has gone wrong recently (and Scala language
does not require to catch exceptions).



2017-01-19 10:05 GMT+01:00 Lorenz B. <[email protected]
ig.de>
:

Hi,

can you clarify what doesn't work?

I tried your example and it would work, but I'm getting a parse
exception because DBpedia (resp. Virtuoso) still returns illegal data:

   Graph g = RDFDataMgr.loadGraph("http://dbpedia.org/resource/Rome";);
   System.out.println(g.size());


[line: 1863, col: 13] Failed to find a prefix name or keyword:
–(8211;0x2013)
Exception in thread "main" org.apache.jena.riot.RiotException: [line:
1863, col: 13] Failed to find a prefix name or keyword: –(8211;0x2013)
    at
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandler
Std.fatal(ErrorHandlerFactory.java:136)
    at
org.apache.jena.riot.lang.LangEngine.raiseException(LangEngi
ne.java:165)
    at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.ja
va:108)
    at
org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleB
ase.java:248)
    at
org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(
LangTurtleBase.java:190)
    at
org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(Lang
Turtle.java:46)
    at
org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtl
eBase.java:89)
    at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
    at
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(R
DFParserRegistry.java:179)
    at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:861)
    at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:667)
    at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:212)
    at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:105)
    at org.apache.jena.riot.RDFDataMgr.loadGraph(RDFDataMgr.java:346)




If there is no parsing error (e.g.http://dbpedia.org/resource/Mars
works
for me)  the result is as expected:


   Graph g = RDFDataMgr.loadGraph("http://dbpedia.org/resource/Mars";);
   System.out.println(g.size());

   Output: 347


The issue was already reported by me, see [1], and [2]

[1] https://github.com/openlink/virtuoso-opensource/issues/567
[2] https://github.com/openlink/virtuoso-opensource/issues/569



Kind regards,
Lorenz

Reading a dbPedia resource with e.g. RDFDataMgr.loadGraph() does not
currently work (it used to work with Jena 3.1.1 ).
Apparently this is because of the 303 redirection.
Is there another call in Jena API to handle redirections and accepting

RDF

MIME types ?

wget --save-headers --header='Accept: application/rdf+xml'
http://dbpedia.org/resource/Rome
--2017-01-19 09:30:08--  http://dbpedia.org/resource/Rome
Résolution de dbpedia.org (dbpedia.org)… 194.109.129.58
Connexion à dbpedia.org (dbpedia.org)|194.109.129.58|:80… connecté.
requête HTTP transmise, en attente de la réponse… *303 See Other*
Emplacement : http://dbpedia.org/data/Rome.xml [suivant]
--2017-01-19 09:30:08--  http://dbpedia.org/data/Rome.xml
Réutilisation de la connexion existante à dbpedia.org:80.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 1003627 (980K) [application/rdf+xml]
Enregistre : «Rome.2»

less Rome.2
HTTP/1.1 200 OK
Date: Thu, 19 Jan 2017 08:30:08 GMT
Content-Type: application/rdf+xml; charset=UTF-8
Content-Length: 1003627
Connection: keep-alive
Vary: Accept-Encoding
Server: Virtuoso/07.20.3217 (Linux) i686-generic-linux-glibc212-64

VDB

Expires: Thu, 26 Jan 2017 08:30:08 GMT
Link: <http://creativecommons.org/licenses/by-sa/3.0/>;rel="licens
e",<
http://dbpedia.org/data/Rome.n3>; rel="alternate"; type="text/n3";
title="Structured Descriptor Document (N3/Turtle format)", <
http://dbpedia.org/data/Rome.json>; rel="alternate";
type="application/json"; title="Structured Descriptor Document

(RDF/JSON

format)", <http://dbpedia.org/data/Rome.atom>; rel="alternate";
type="application/atom+xml"; title="OData (Atom+Feed format)", <
http://dbpedia.org/data/Rome.jsod>; rel="alternate";
type="application/odata+json"; title="OData (JSON format)", <
http://dbpedia.org/page/Rome>; rel="alternate"; type="text/html";
title="XHTML+RDFa", <http://dbpedia.org/resource/Rome>; rel="
http://xmlns.com/foaf/0.1/primaryTopic";, <

http://dbpedia.org/resource/Rome>;

rev="describedby", <
http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbped

ia.org/data/Rome.xml>;

rel="timegate"
X-SPARQL-default-graph: http://dbpedia.org
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Headers:
DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If

-Modified-Since,Cache-Control,Content-Type,Accept-Encoding

Accept-Ranges: bytes

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
...

--
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center




--
Jean-Marc Vanel
Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%2F
jmvanel.free.fr%2Fjmv.rdf%23me
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52 <+33%206%2089%2016%2029%2052>
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui




--
Jean-Marc Vanel
Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%
2Fjmvanel.free.fr%2Fjmv.rdf%23me
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52 <+33%206%2089%2016%2029%2052>
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui







Reply via email to