Re: [Virtuoso-users] Virtuoso DBpedia load - parsing errors

Roman Sokolov Wed, 23 Sep 2015 07:16:46 -0700

Thanks a lot for your help, Patrick!
Yes, my mistake, it is BTC dataset, not DBpedia.
I changed the literal types from XML to Plain and the errors disappeared.


But now I got the new error:
/btc2014_unzipped/01/data.nq-10
http://fake-latest.org
       2           2015.9.22 23:10.20 322216000  2015.9.22 23:10.38
888367000  0           NULL        42000 RDFGE: RDF box with a geometry RDF
type and a non-geometry content

This error is quite frequent in the dataset. And I guess it is related to
geo-data. But the problem is, in contrast to the previous error, I can not
see the details and the line where the error occured, so I can not check in
the dataset which line caused the error. Strange that there is no details...

Thank you.

On 18 September 2015 at 13:42, Patrick van Kleef <pkl...@openlinksw.com>
wrote:

> Hi Roman,
>
> > Hello.
> > I have a lot of errors when I want to load DBpedia dataset using isql,
> the command:
> > ld_dir('/workingDir/btc2014_unzipped/01', 'data.nq-*', 'http://fake.org
> ');
> >
> > Example error:
> >
> >  22007 XM003: XML parser detected an error:     ERROR  : Tag nesting
> >  error: name 'img' of end tag does not match the name 'p' of start tag
> >  at line 4 column 432 at line 4 column 438 of source text
> >  04/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#
> "></img></p>
> >  ----------------------------------------------------------------------^
> >
> > Ok, let's find the line where the error occured (I put a line break, so
> it is easier to see):
> >
> > <http://core-project.kmi.open.ac.uk/data-description> <
> http://purl.org/rss/1.0/modules/content/encoded> "<h2 xmlns=\"
> http://www.w3.org/1999/xhtml\"; xmlns:content=\"
> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
> http://www.w3.org/2001/XMLSchema#\";>What data are exposed</h2>\n<p
> xmlns=\"http://www.w3.org/1999/xhtml\"; xmlns:content=\"
> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
> http://www.w3.org/2001/XMLSchema#\";>The CORE project exposes data about
> the aggregated content. The following schema shows the kind of metadata
> CORE holds about each resource. </p>\n<h2 xmlns=\"
> http://www.w3.org/1999/xhtml\"; xmlns:content=\"
> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
> http://www.w3.org/2001/XMLSchema#\";>Data Schema</h2>\n<p xmlns=\"
> http://www.w3.org/1999/xhtml\"; xmlns:content=\"
> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
> http://www.w3.org/2001/XMLSchema#\";></img></p>
> >     \n<h2 xmlns=\"http://www.w3.org/1999/xhtml\"; xmlns:content=\"
> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
> http://www.w3.org/2001/XMLSchema#\";>Data License</h2>\n<p xmlns=\"
> http://www.w3.org/1999/xhtml\"; xmlns:content=\"
> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
> http://www.w3.org/2001/XMLSchema#\";>All data from CORE (unless otherwise
> specified) are available under the a Creative Commons Attribution 3.0
> Unported License. </p>\n"^^<
> http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> .
> >
> > Also tried to load using different errors bits, the same result:
> > DB.DBA.TTLP_MT (file_to_string_output
> ('/workingDir/btc2014_unzipped/01/data.nq-9'), '', 'http://fake.org', 512)
> >
> > Why Virtuoso tries to check HTML/XML tags consistency inside the
> literals?! Is it possible to turn it off? I have too many errors in the
> dataset, it is a waste of time trying to find all lines with errors and
> remove them by hands. Can't find anything related to this in the
> documentation.
>
>
> I thought i spotted a parsing error on our end, but on closer examination
> this was not the case.
>
> The issue here is that this value is tagged as a <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> which triggers
> Virtuoso to actually parse the XML inside the object.
>
> Unfortunately it appears there was either a problem with a lot of pages
> they crawled for this BTC 2014 dataset, or they cut out part of the page.
> In any case i examined a number of lines that failed and all had issues
> with artifacts causing the strings not to be valid XML.
>
> Virtuoso actually can be build with the Tidy library which we use in our
> Sponger and crawlers to fixup the HTML from pages before further parsing
> it, to make sure these kind of errors do not occur when then dumping the
> data, but that does not help you at this point.
>
> One thing you can do is to edit the files and change
>
>         http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
>
> to
>
>         http://www.w3.org/1999/02/22-rdf-syntax-ns#PlainLiteral
>
>
> That would allow the bulk loader to load these strings without parsing so
> you would get these triples in place.
>
> It might have a small effect on the size of the free text index when using
> the CONTAINS keyword in SPARQL, but other than that it should be ok.
>
>
> Patrick
> ---
> Patrick van Kleef
> Program Manager
> OpenLink Software
>
> http://www.openlinksw.com/
> http://twitter.com/openlink/
>
>


-- 
Best regards, Roman Sokolov

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Re: [Virtuoso-users] Virtuoso DBpedia load - parsing errors

Reply via email to