Hi Roman, > Hello. > I have a lot of errors when I want to load DBpedia dataset using isql, the > command: > ld_dir('/workingDir/btc2014_unzipped/01', 'data.nq-*', 'http://fake.org'); > > Example error: > > 22007 XM003: XML parser detected an error: ERROR : Tag nesting > error: name 'img' of end tag does not match the name 'p' of start tag > at line 4 column 432 at line 4 column 438 of source text > 04/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"></img></p> > ----------------------------------------------------------------------^ > > Ok, let's find the line where the error occured (I put a line break, so it is > easier to see): > > <http://core-project.kmi.open.ac.uk/data-description> > <http://purl.org/rss/1.0/modules/content/encoded> "<h2 > xmlns=\"http://www.w3.org/1999/xhtml\" > xmlns:content=\"http://purl.org/rss/1.0/modules/content/\" > xmlns:dc=\"http://purl.org/dc/terms/\" > xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" xmlns:og=\"http://ogp.me/ns#\" > xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\" > xmlns:sioc=\"http://rdfs.org/sioc/ns#\" > xmlns:sioct=\"http://rdfs.org/sioc/types#\" > xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" > xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\">What data are > exposed</h2>\n<p xmlns=\"http://www.w3.org/1999/xhtml\" > xmlns:content=\"http://purl.org/rss/1.0/modules/content/\" > xmlns:dc=\"http://purl.org/dc/terms/\" > xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" xmlns:og=\"http://ogp.me/ns#\" > xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\" > xmlns:sioc=\"http://rdfs.org/sioc/ns#\" > xmlns:sioct=\"http://rdfs.org/sioc/types#\" > xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd= \"http://www.w3.org/2001/XMLSchema#\">The CORE project exposes data about the aggregated content. The following schema shows the kind of metadata CORE holds about each resource. </p>\n<h2 xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:content=\"http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\"http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\"http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/types#\" xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\">Data Schema</h2>\n<p xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:content=\"http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\"http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\"http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/typ es#\" xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\"></img></p> > \n<h2 xmlns=\"http://www.w3.org/1999/xhtml\" > xmlns:content=\"http://purl.org/rss/1.0/modules/content/\" > xmlns:dc=\"http://purl.org/dc/terms/\" > xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" xmlns:og=\"http://ogp.me/ns#\" > xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\" > xmlns:sioc=\"http://rdfs.org/sioc/ns#\" > xmlns:sioct=\"http://rdfs.org/sioc/types#\" > xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" > xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\">Data License</h2>\n<p > xmlns=\"http://www.w3.org/1999/xhtml\" > xmlns:content=\"http://purl.org/rss/1.0/modules/content/\" > xmlns:dc=\"http://purl.org/dc/terms/\" > xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" xmlns:og=\"http://ogp.me/ns#\" > xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\" > xmlns:sioc=\"http://rdfs.org/sioc/ns#\" > xmlns:sioct=\"http://rdfs.org/sioc/types#\" > xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" > xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\">All data from CORE (unless > otherwise specified) are available under th e a Creative Commons Attribution 3.0 Unported License. </p>\n"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> . > > Also tried to load using different errors bits, the same result: > DB.DBA.TTLP_MT (file_to_string_output > ('/workingDir/btc2014_unzipped/01/data.nq-9'), '', 'http://fake.org', 512) > > Why Virtuoso tries to check HTML/XML tags consistency inside the literals?! > Is it possible to turn it off? I have too many errors in the dataset, it is a > waste of time trying to find all lines with errors and remove them by hands. > Can't find anything related to this in the documentation.
I thought i spotted a parsing error on our end, but on closer examination this was not the case. The issue here is that this value is tagged as a <http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> which triggers Virtuoso to actually parse the XML inside the object. Unfortunately it appears there was either a problem with a lot of pages they crawled for this BTC 2014 dataset, or they cut out part of the page. In any case i examined a number of lines that failed and all had issues with artifacts causing the strings not to be valid XML. Virtuoso actually can be build with the Tidy library which we use in our Sponger and crawlers to fixup the HTML from pages before further parsing it, to make sure these kind of errors do not occur when then dumping the data, but that does not help you at this point. One thing you can do is to edit the files and change http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral to http://www.w3.org/1999/02/22-rdf-syntax-ns#PlainLiteral That would allow the bulk loader to load these strings without parsing so you would get these triples in place. It might have a small effect on the size of the free text index when using the CONTAINS keyword in SPARQL, but other than that it should be ok. Patrick --- Patrick van Kleef Program Manager OpenLink Software http://www.openlinksw.com/ http://twitter.com/openlink/ ------------------------------------------------------------------------------ _______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users