So could somebody help me to understand how to deal with this error while importing the data? /btc2014_unzipped/01/data.nq-10 http://fake-latest.org 2 2015.9.22 23:10.20 322216000 2015.9.22 23:10.38 888367000 0 NULL 42000 RDFGE: RDF box with a geometry RDF type and a non-geometry content
There is no clue which particular lines cause the error, so I stuck and can not remove or change them. Or how can I load the data without lines containing errors? Thank you. On 23 September 2015 at 16:12, Roman Sokolov <ole...@gmail.com> wrote: > Thanks a lot for your help, Patrick! > Yes, my mistake, it is BTC dataset, not DBpedia. > I changed the literal types from XML to Plain and the errors disappeared. > > But now I got the new error: > /btc2014_unzipped/01/data.nq-10 > http://fake-latest.org > 2 2015.9.22 23:10.20 322216000 2015.9.22 23:10.38 > 888367000 0 NULL 42000 RDFGE: RDF box with a geometry RDF > type and a non-geometry content > > This error is quite frequent in the dataset. And I guess it is related to > geo-data. But the problem is, in contrast to the previous error, I can not > see the details and the line where the error occured, so I can not check in > the dataset which line caused the error. Strange that there is no details... > > Thank you. > > On 18 September 2015 at 13:42, Patrick van Kleef <pkl...@openlinksw.com> > wrote: > >> Hi Roman, >> >> > Hello. >> > I have a lot of errors when I want to load DBpedia dataset using isql, >> the command: >> > ld_dir('/workingDir/btc2014_unzipped/01', 'data.nq-*', 'http://fake.org >> '); >> > >> > Example error: >> > >> > 22007 XM003: XML parser detected an error: ERROR : Tag nesting >> > error: name 'img' of end tag does not match the name 'p' of start tag >> > at line 4 column 432 at line 4 column 438 of source text >> > 04/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema# >> "></img></p> >> > ----------------------------------------------------------------------^ >> > >> > Ok, let's find the line where the error occured (I put a line break, so >> it is easier to see): >> > >> > <http://core-project.kmi.open.ac.uk/data-description> < >> http://purl.org/rss/1.0/modules/content/encoded> "<h2 xmlns=\" >> http://www.w3.org/1999/xhtml\" xmlns:content=\" >> http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\" >> http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" >> xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\" >> http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\" >> http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/types#\" >> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\" >> http://www.w3.org/2001/XMLSchema#\">What data are exposed</h2>\n<p >> xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:content=\" >> http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\" >> http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" >> xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\" >> http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\" >> http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/types#\" >> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\" >> http://www.w3.org/2001/XMLSchema#\">The CORE project exposes data about >> the aggregated content. The following schema shows the kind of metadata >> CORE holds about each resource. </p>\n<h2 xmlns=\" >> http://www.w3.org/1999/xhtml\" xmlns:content=\" >> http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\" >> http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" >> xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\" >> http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\" >> http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/types#\" >> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\" >> http://www.w3.org/2001/XMLSchema#\">Data Schema</h2>\n<p xmlns=\" >> http://www.w3.org/1999/xhtml\" xmlns:content=\" >> http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\" >> http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" >> xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\" >> http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\" >> http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/types#\" >> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\" >> http://www.w3.org/2001/XMLSchema#\"></img></p> >> > \n<h2 xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:content=\" >> http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\" >> http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" >> xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\" >> http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\" >> http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/types#\" >> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\" >> http://www.w3.org/2001/XMLSchema#\">Data License</h2>\n<p xmlns=\" >> http://www.w3.org/1999/xhtml\" xmlns:content=\" >> http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\" >> http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" >> xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\" >> http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\" >> http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/types#\" >> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\" >> http://www.w3.org/2001/XMLSchema#\">All data from CORE (unless otherwise >> specified) are available under the a Creative Commons Attribution 3.0 >> Unported License. </p>\n"^^< >> http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> . >> > >> > Also tried to load using different errors bits, the same result: >> > DB.DBA.TTLP_MT (file_to_string_output >> ('/workingDir/btc2014_unzipped/01/data.nq-9'), '', 'http://fake.org', >> 512) >> > >> > Why Virtuoso tries to check HTML/XML tags consistency inside the >> literals?! Is it possible to turn it off? I have too many errors in the >> dataset, it is a waste of time trying to find all lines with errors and >> remove them by hands. Can't find anything related to this in the >> documentation. >> >> >> I thought i spotted a parsing error on our end, but on closer examination >> this was not the case. >> >> The issue here is that this value is tagged as a < >> http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> which triggers >> Virtuoso to actually parse the XML inside the object. >> >> Unfortunately it appears there was either a problem with a lot of pages >> they crawled for this BTC 2014 dataset, or they cut out part of the page. >> In any case i examined a number of lines that failed and all had issues >> with artifacts causing the strings not to be valid XML. >> >> Virtuoso actually can be build with the Tidy library which we use in our >> Sponger and crawlers to fixup the HTML from pages before further parsing >> it, to make sure these kind of errors do not occur when then dumping the >> data, but that does not help you at this point. >> >> One thing you can do is to edit the files and change >> >> http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral >> >> to >> >> http://www.w3.org/1999/02/22-rdf-syntax-ns#PlainLiteral >> >> >> That would allow the bulk loader to load these strings without parsing so >> you would get these triples in place. >> >> It might have a small effect on the size of the free text index when >> using the CONTAINS keyword in SPARQL, but other than that it should be ok. >> >> >> Patrick >> --- >> Patrick van Kleef >> Program Manager >> OpenLink Software >> >> http://www.openlinksw.com/ >> http://twitter.com/openlink/ >> >> > > > -- > Best regards, Roman Sokolov > > > -- Best regards, Roman Sokolov
------------------------------------------------------------------------------
_______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users