good afternoon; On 2015-04-01, at 16:17, Andy Seaborne <[email protected]> wrote:
> Thanks for that. > JENA-911 created. > > Each of the large public dumps has had quality issues. I'm sure wikidata > will fix their process if someone helps them. (Freebase did.) > > I understand it's frustrating but fixing it in the parser/loader is not a > real fix, only a limited workaround, because that data can be passed on to > with systems which can't cope. That's what standards are for!! > > > (anyone know who is involved?) for wikidata, our feedback led to #133 (https://github.com/Wikidata/Wikidata-Toolkit/issues/133). we had attempted to load their core dataset in the hope of working with their temporal data, and with a thought to hosting the full dataset, but the invalid iri terms have slowed that endeavour down. > > The RDF 1.1 took some time to look at orignal-NT - the <>-grammar rule > allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not > bad) then even things like \n (which was an NL not the characters "\" and "n" > as the widedata people are using it) are not getting through. The original > NT grammar was specific for test cases and is open and loose by design. > > Please do feed back to wikidata and we can hope it gets fixed at source. see above. > > (Ditto DBpedia for that matter) > > Andy > > Related: JENA-864 > > NFC and NFCK are two normalization requirements (warnings, not errors) but > they seem to be more of a hinderance than a help so I'm suggesting removing > the checking. The IRIs are legal even if no NFC - just not in the preferred > by W3C form. > > On 01/04/15 14:11, Michael Brunnbauer wrote: >> >> Hello Andy, >> >> [tdbloader2 disk access pattern] >>> Lots of unique nodes can slow things down because of all the node writing. >> >> And there is no way to convert this algorithm to sequential access? >> >> [tdbloader2 parser] >>>>> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in >>>>> IRIs. >>> >>> Could you provide a set of data with one feature per NTriple line,marking in >>> a comment what you expect, and I'll check each one and add them to the test >>> suite. >> >> See attachment. I would consider all triples in it illegal according to the >> n triples spec. >> >> If I allow these characters that RFC 1738 calls "unsafe", why then not allow >> CR, LF and TAB? And why then allow \\ but not \", which seems to be >> sanctioned >> by older versions of the spec: >> >> http://www.w3.org/2001/sw/RDFCore/ntriples/#character >> >> I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n >> IRIs, e.G.: >> >> <http://www.wikidata.org/entity/P1348v> >> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> . >> <http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D> >> <http://www.wikidata.org/entity/P18v> >> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg> >> . >> >> This trial and error cleaning of data dumps with self made scripts and days >> between each try is very straining and probably a big deterrent for >> newcomers. >> I had it with DBpedia and now I have it with Wikidata all over again (with >> new syntax problems). >> >> Regards, >> >> Michael Brunnbauer >> > --- james anderson | [email protected] | http://dydra.com
