Re: NT issues (was: Re: tdbloader2 issues)

james anderson Wed, 01 Apr 2015 08:21:08 -0700

good afternoon;

On 2015-04-01, at 16:17, Andy Seaborne <[email protected]> wrote:


> Thanks for that.
> JENA-911 created.
> 
> Each of the large public dumps has had quality issues.  I'm sure wikidata 
> will fix their process if someone helps them.  (Freebase did.)
> 
> I understand it's frustrating but fixing it in the parser/loader is not a 
> real fix, only a limited workaround, because that data can be passed on to 
> with systems which can't cope.  That's what standards are for!!
> 
> 
> (anyone know who is involved?)

for wikidata, our feedback led to #133 
(https://github.com/Wikidata/Wikidata-Toolkit/issues/133).
we had attempted to load their core dataset in the hope of working with their 
temporal data, and with a thought to hosting the full dataset, but the invalid 
iri terms have slowed that endeavour down.

> 
> The RDF 1.1 took some time to look at orignal-NT  - the <>-grammar rule 
> allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not 
> bad) then even things like \n (which was an NL not the characters "\" and "n" 
> as the widedata people are using it) are not getting through.  The original 
> NT grammar was specific for test cases and is open and loose by design.
> 
> Please do feed back to wikidata and we can hope it gets fixed at source.

see above.

> 
> (Ditto DBpedia for that matter)
> 
>       Andy
> 
> Related: JENA-864
> 
> NFC and NFCK are two normalization requirements (warnings, not errors) but 
> they seem to be more of a hinderance than a help so I'm suggesting removing 
> the checking.  The IRIs are legal even if no NFC - just not in the preferred 
> by W3C form.
> 
> On 01/04/15 14:11, Michael Brunnbauer wrote:
>> 
>> Hello Andy,
>> 
>> [tdbloader2 disk access pattern]
>>> Lots of unique nodes can slow things down because of all the node writing.
>> 
>> And there is no way to convert this algorithm to sequential access?
>> 
>> [tdbloader2 parser]
>>>>> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in 
>>>>> IRIs.
>>> 
>>> Could you provide a set of data with one feature per NTriple line,marking in
>>> a comment what you expect, and I'll check each one and add them to the test
>>> suite.
>> 
>> See attachment. I would consider all triples in it illegal according to the
>> n triples spec.
>> 
>> If I allow these characters that RFC 1738 calls "unsafe", why then not allow
>> CR, LF and TAB? And why then allow \\ but not \", which seems to be 
>> sanctioned
>> by older versions of the spec:
>> 
>>  http://www.w3.org/2001/sw/RDFCore/ntriples/#character
>> 
>> I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
>> IRIs, e.G.:
>> 
>> <http://www.wikidata.org/entity/P1348v> 
>> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
>> <http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D>
>>  <http://www.wikidata.org/entity/P18v> 
>> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg>
>>  .
>> 
>> This trial and error cleaning of data dumps with self made scripts and days
>> between each try is very straining and probably a big deterrent for 
>> newcomers.
>> I had it with DBpedia and now I have it with Wikidata all over again (with
>> new syntax problems).
>> 
>> Regards,
>> 
>> Michael Brunnbauer
>> 
> 



---
james anderson | [email protected] | http://dydra.com

Re: NT issues (was: Re: tdbloader2 issues)

Reply via email to