Re: [Virtuoso-users] Virtuoso DBpedia load - parsing errors

Roman Sokolov Wed, 30 Sep 2015 09:40:53 -0700

So could somebody help me to understand how to deal with this error while
importing the data?
/btc2014_unzipped/01/data.nq-10
http://fake-latest.org
       2           2015.9.22 23:10.20 322216000  2015.9.22 23:10.38
888367000  0           NULL        42000 RDFGE: RDF box with a geometry RDF
type and a non-geometry content


There is no clue which particular lines cause the error, so I stuck and can
not remove or change them.
Or how can I load the data without lines containing errors?

Thank you.


On 23 September 2015 at 16:12, Roman Sokolov <ole...@gmail.com> wrote:

> Thanks a lot for your help, Patrick!
> Yes, my mistake, it is BTC dataset, not DBpedia.
> I changed the literal types from XML to Plain and the errors disappeared.
>
> But now I got the new error:
> /btc2014_unzipped/01/data.nq-10
> http://fake-latest.org
>          2           2015.9.22 23:10.20 322216000  2015.9.22 23:10.38
> 888367000  0           NULL        42000 RDFGE: RDF box with a geometry RDF
> type and a non-geometry content
>
> This error is quite frequent in the dataset. And I guess it is related to
> geo-data. But the problem is, in contrast to the previous error, I can not
> see the details and the line where the error occured, so I can not check in
> the dataset which line caused the error. Strange that there is no details...
>
> Thank you.
>
> On 18 September 2015 at 13:42, Patrick van Kleef <pkl...@openlinksw.com>
> wrote:
>
>> Hi Roman,
>>
>> > Hello.
>> > I have a lot of errors when I want to load DBpedia dataset using isql,
>> the command:
>> > ld_dir('/workingDir/btc2014_unzipped/01', 'data.nq-*', 'http://fake.org
>> ');
>> >
>> > Example error:
>> >
>> >  22007 XM003: XML parser detected an error:     ERROR  : Tag nesting
>> >  error: name 'img' of end tag does not match the name 'p' of start tag
>> >  at line 4 column 432 at line 4 column 438 of source text
>> >  04/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#
>> "></img></p>
>> >  ----------------------------------------------------------------------^
>> >
>> > Ok, let's find the line where the error occured (I put a line break, so
>> it is easier to see):
>> >
>> > <http://core-project.kmi.open.ac.uk/data-description> <
>> http://purl.org/rss/1.0/modules/content/encoded> "<h2 xmlns=\"
>> http://www.w3.org/1999/xhtml\"; xmlns:content=\"
>> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
>> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
>> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
>> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
>> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
>> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
>> http://www.w3.org/2001/XMLSchema#\";>What data are exposed</h2>\n<p
>> xmlns=\"http://www.w3.org/1999/xhtml\"; xmlns:content=\"
>> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
>> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
>> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
>> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
>> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
>> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
>> http://www.w3.org/2001/XMLSchema#\";>The CORE project exposes data about
>> the aggregated content. The following schema shows the kind of metadata
>> CORE holds about each resource. </p>\n<h2 xmlns=\"
>> http://www.w3.org/1999/xhtml\"; xmlns:content=\"
>> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
>> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
>> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
>> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
>> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
>> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
>> http://www.w3.org/2001/XMLSchema#\";>Data Schema</h2>\n<p xmlns=\"
>> http://www.w3.org/1999/xhtml\"; xmlns:content=\"
>> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
>> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
>> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
>> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
>> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
>> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
>> http://www.w3.org/2001/XMLSchema#\";></img></p>
>> >     \n<h2 xmlns=\"http://www.w3.org/1999/xhtml\"; xmlns:content=\"
>> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
>> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
>> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
>> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
>> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
>> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
>> http://www.w3.org/2001/XMLSchema#\";>Data License</h2>\n<p xmlns=\"
>> http://www.w3.org/1999/xhtml\"; xmlns:content=\"
>> http://purl.org/rss/1.0/modules/content/\"; xmlns:dc=\"
>> http://purl.org/dc/terms/\"; xmlns:foaf=\"http://xmlns.com/foaf/0.1/\";
>> xmlns:og=\"http://ogp.me/ns#\"; xmlns:rdfs=\"
>> http://www.w3.org/2000/01/rdf-schema#\"; xmlns:sioc=\"
>> http://rdfs.org/sioc/ns#\"; xmlns:sioct=\"http://rdfs.org/sioc/types#\";
>> xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\"; xmlns:xsd=\"
>> http://www.w3.org/2001/XMLSchema#\";>All data from CORE (unless otherwise
>> specified) are available under the a Creative Commons Attribution 3.0
>> Unported License. </p>\n"^^<
>> http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> .
>> >
>> > Also tried to load using different errors bits, the same result:
>> > DB.DBA.TTLP_MT (file_to_string_output
>> ('/workingDir/btc2014_unzipped/01/data.nq-9'), '', 'http://fake.org',
>> 512)
>> >
>> > Why Virtuoso tries to check HTML/XML tags consistency inside the
>> literals?! Is it possible to turn it off? I have too many errors in the
>> dataset, it is a waste of time trying to find all lines with errors and
>> remove them by hands. Can't find anything related to this in the
>> documentation.
>>
>>
>> I thought i spotted a parsing error on our end, but on closer examination
>> this was not the case.
>>
>> The issue here is that this value is tagged as a <
>> http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> which triggers
>> Virtuoso to actually parse the XML inside the object.
>>
>> Unfortunately it appears there was either a problem with a lot of pages
>> they crawled for this BTC 2014 dataset, or they cut out part of the page.
>> In any case i examined a number of lines that failed and all had issues
>> with artifacts causing the strings not to be valid XML.
>>
>> Virtuoso actually can be build with the Tidy library which we use in our
>> Sponger and crawlers to fixup the HTML from pages before further parsing
>> it, to make sure these kind of errors do not occur when then dumping the
>> data, but that does not help you at this point.
>>
>> One thing you can do is to edit the files and change
>>
>>         http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
>>
>> to
>>
>>         http://www.w3.org/1999/02/22-rdf-syntax-ns#PlainLiteral
>>
>>
>> That would allow the bulk loader to load these strings without parsing so
>> you would get these triples in place.
>>
>> It might have a small effect on the size of the free text index when
>> using the CONTAINS keyword in SPARQL, but other than that it should be ok.
>>
>>
>> Patrick
>> ---
>> Patrick van Kleef
>> Program Manager
>> OpenLink Software
>>
>> http://www.openlinksw.com/
>> http://twitter.com/openlink/
>>
>>
>
>
> --
> Best regards, Roman Sokolov
>
>
>


-- 
Best regards, Roman Sokolov

------------------------------------------------------------------------------

_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Re: [Virtuoso-users] Virtuoso DBpedia load - parsing errors

Reply via email to