Hi Roman,

> Hello.
> I have a lot of errors when I want to load DBpedia dataset using isql, the 
> command:
> ld_dir('/workingDir/btc2014_unzipped/01', 'data.nq-*', 'http://fake.org');
> 
> Example error:
> 
>  22007 XM003: XML parser detected an error:     ERROR  : Tag nesting
>  error: name 'img' of end tag does not match the name 'p' of start tag
>  at line 4 column 432 at line 4 column 438 of source text
>  04/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#";></img></p>
>  ----------------------------------------------------------------------^
> 
> Ok, let's find the line where the error occured (I put a line break, so it is 
> easier to see):
> 
> <http://core-project.kmi.open.ac.uk/data-description> 
> <http://purl.org/rss/1.0/modules/content/encoded> "<h2 
> xmlns=\"http://www.w3.org/1999/xhtml\"; 
> xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"; 
> xmlns:dc=\"http://purl.org/dc/terms/\"; 
> xmlns:foaf=\"http://xmlns.com/foaf/0.1/\"; xmlns:og=\"http://ogp.me/ns#\"; 
> xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"; 
> xmlns:sioc=\"http://rdfs.org/sioc/ns#\"; 
> xmlns:sioct=\"http://rdfs.org/sioc/types#\"; 
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; 
> xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\";>What data are 
> exposed</h2>\n<p xmlns=\"http://www.w3.org/1999/xhtml\"; 
> xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"; 
> xmlns:dc=\"http://purl.org/dc/terms/\"; 
> xmlns:foaf=\"http://xmlns.com/foaf/0.1/\"; xmlns:og=\"http://ogp.me/ns#\"; 
> xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"; 
> xmlns:sioc=\"http://rdfs.org/sioc/ns#\"; 
> xmlns:sioct=\"http://rdfs.org/sioc/types#\"; 
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=
 \"http://www.w3.org/2001/XMLSchema#\";>The CORE project exposes data about the 
aggregated content. The following schema shows the kind of metadata CORE holds 
about each resource. </p>\n<h2 xmlns=\"http://www.w3.org/1999/xhtml\"; 
xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"; 
xmlns:dc=\"http://purl.org/dc/terms/\"; 
xmlns:foaf=\"http://xmlns.com/foaf/0.1/\"; xmlns:og=\"http://ogp.me/ns#\"; 
xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"; 
xmlns:sioc=\"http://rdfs.org/sioc/ns#\"; 
xmlns:sioct=\"http://rdfs.org/sioc/types#\"; 
xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; 
xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\";>Data Schema</h2>\n<p 
xmlns=\"http://www.w3.org/1999/xhtml\"; 
xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"; 
xmlns:dc=\"http://purl.org/dc/terms/\"; 
xmlns:foaf=\"http://xmlns.com/foaf/0.1/\"; xmlns:og=\"http://ogp.me/ns#\"; 
xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"; 
xmlns:sioc=\"http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/typ
 es#\" xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; 
xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\";></img></p>
>     \n<h2 xmlns=\"http://www.w3.org/1999/xhtml\"; 
> xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"; 
> xmlns:dc=\"http://purl.org/dc/terms/\"; 
> xmlns:foaf=\"http://xmlns.com/foaf/0.1/\"; xmlns:og=\"http://ogp.me/ns#\"; 
> xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"; 
> xmlns:sioc=\"http://rdfs.org/sioc/ns#\"; 
> xmlns:sioct=\"http://rdfs.org/sioc/types#\"; 
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; 
> xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\";>Data License</h2>\n<p 
> xmlns=\"http://www.w3.org/1999/xhtml\"; 
> xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"; 
> xmlns:dc=\"http://purl.org/dc/terms/\"; 
> xmlns:foaf=\"http://xmlns.com/foaf/0.1/\"; xmlns:og=\"http://ogp.me/ns#\"; 
> xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"; 
> xmlns:sioc=\"http://rdfs.org/sioc/ns#\"; 
> xmlns:sioct=\"http://rdfs.org/sioc/types#\"; 
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; 
> xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\";>All data from CORE (unless 
> otherwise specified) are available under th
 e a Creative Commons Attribution 3.0 Unported License. 
</p>\n"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> .
> 
> Also tried to load using different errors bits, the same result:
> DB.DBA.TTLP_MT (file_to_string_output 
> ('/workingDir/btc2014_unzipped/01/data.nq-9'), '', 'http://fake.org', 512)
> 
> Why Virtuoso tries to check HTML/XML tags consistency inside the literals?! 
> Is it possible to turn it off? I have too many errors in the dataset, it is a 
> waste of time trying to find all lines with errors and remove them by hands. 
> Can't find anything related to this in the documentation.


I thought i spotted a parsing error on our end, but on closer examination this 
was not the case.

The issue here is that this value is tagged as a 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> which triggers Virtuoso 
to actually parse the XML inside the object. 

Unfortunately it appears there was either a problem with a lot of pages they 
crawled for this BTC 2014 dataset, or they cut out part of the page. In any 
case i examined a number of lines that failed and all had issues with artifacts 
causing the strings not to be valid XML. 

Virtuoso actually can be build with the Tidy library which we use in our 
Sponger and crawlers to fixup the HTML from pages before further parsing it, to 
make sure these kind of errors do not occur when then dumping the data, but 
that does not help you at this point.

One thing you can do is to edit the files and change

        http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral

to 

        http://www.w3.org/1999/02/22-rdf-syntax-ns#PlainLiteral
        

That would allow the bulk loader to load these strings without parsing so you 
would get these triples in place. 

It might have a small effect on the size of the free text index when using the 
CONTAINS keyword in SPARQL, but other than that it should be ok.


Patrick
---
Patrick van Kleef
Program Manager
OpenLink Software

http://www.openlinksw.com/
http://twitter.com/openlink/


------------------------------------------------------------------------------
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to