Re: NT issues (was: Re: tdbloader2 issues)

james anderson Wed, 01 Apr 2015 10:10:41 -0700

good afternoon;

On 2015-04-01, at 16:30, Michael Brunnbauer <[email protected]> wrote:


> 
> Hello Andy,
> 
> it would just be great to have a mode for tdbloader[2] where invalid
> triples/quads are simply ignored.

somehow that seems like a bad idea.

there are already tools which one could use to that end.
in the case of the core wikidata dataset, rapper (which i do not hereby to 
elevate to the role of nt conformance accreditation, but anyway) rejects 
several thousand statements and can be used to reduce the dataset to those 
which are valid.

$ rapper -i ntriples -o ntriples wikidata-statements.nt > 
wikidata-statements-clean.nt 2> wikidata-statements-errors.txt
$ ls -l wikidata-statements*
-rw-r--r-- 1 root root 38770855255 Apr  1 16:00 wikidata-statements-clean.nt
-rw-r--r-- 1 root root     1540120 Apr  1 16:00 wikidata-statements-errors.txt
-rw-r--r-- 1 root root 38772450070 Mar 28 08:15 wikidata-statements.nt
$ wc -l wikidata*
  233096736 wikidata-statements-clean.nt
       9627 wikidata-statements-errors.txt
  233106288 wikidata-statements.nt

from which i looks like some of the errors do not cause it to suppress the 
statement.

we would be reluctant to host something in that condition as a service, as one 
never knows which relations have been eliminated and how central they might be 
to the dataset’s utility.


best regards, from berlin,

> 
> Regards,
> 
> Michael Brunnbauer
> 
> On Wed, Apr 01, 2015 at 03:17:08PM +0100, Andy Seaborne wrote:
>> Thanks for that.
>> JENA-911 created.
>> 
>> Each of the large public dumps has had quality issues.  I'm sure wikidata
>> will fix their process if someone helps them.  (Freebase did.)
>> 
>> I understand it's frustrating but fixing it in the parser/loader is not a
>> real fix, only a limited workaround, because that data can be passed on to
>> with systems which can't cope.  That's what standards are for!!
>> 
>> 
>> (anyone know who is involved?)
>> 
>> The RDF 1.1 took some time to look at orignal-NT  - the <>-grammar rule
>> allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not
>> bad) then even things like \n (which was an NL not the characters "\" and
>> "n" as the widedata people are using it) are not getting through.  The
>> original NT grammar was specific for test cases and is open and loose by
>> design.
>> 
>> Please do feed back to wikidata and we can hope it gets fixed at source.
>> 
>> (Ditto DBpedia for that matter)
>> 
>>      Andy
>> 
>> Related: JENA-864
>> 
>> NFC and NFCK are two normalization requirements (warnings, not errors) but
>> they seem to be more of a hinderance than a help so I'm suggesting removing
>> the checking.  The IRIs are legal even if no NFC - just not in the preferred
>> by W3C form.
>> 
>> On 01/04/15 14:11, Michael Brunnbauer wrote:
>>> 
>>> Hello Andy,
>>> 
>>> [tdbloader2 disk access pattern]
>>>> Lots of unique nodes can slow things down because of all the node writing.
>>> 
>>> And there is no way to convert this algorithm to sequential access?
>>> 
>>> [tdbloader2 parser]
>>>>>> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those 
>>>>>> in IRIs.
>>>> 
>>>> Could you provide a set of data with one feature per NTriple line,marking 
>>>> in
>>>> a comment what you expect, and I'll check each one and add them to the test
>>>> suite.
>>> 
>>> See attachment. I would consider all triples in it illegal according to the
>>> n triples spec.
>>> 
>>> If I allow these characters that RFC 1738 calls "unsafe", why then not allow
>>> CR, LF and TAB? And why then allow \\ but not \", which seems to be 
>>> sanctioned
>>> by older versions of the spec:
>>> 
>>> http://www.w3.org/2001/sw/RDFCore/ntriples/#character
>>> 
>>> I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
>>> IRIs, e.G.:
>>> 
>>> <http://www.wikidata.org/entity/P1348v> 
>>> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
>>> <http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D>
>>>  <http://www.wikidata.org/entity/P18v> 
>>> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg>
>>>  .
>>> 
>>> This trial and error cleaning of data dumps with self made scripts and days
>>> between each try is very straining and probably a big deterrent for 
>>> newcomers.
>>> I had it with DBpedia and now I have it with Wikidata all over again (with
>>> new syntax problems).
>>> 
>>> Regards,
>>> 
>>> Michael Brunnbauer
>>> 
> 
> -- 
> ++  Michael Brunnbauer
> ++  netEstate GmbH
> ++  Geisenhausener Straße 11a
> ++  81379 München
> ++  Tel +49 89 32 19 77 80
> ++  Fax +49 89 32 19 77 89 
> ++  E-Mail [email protected]
> ++  http://www.netestate.de/
> ++
> ++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
> ++  USt-IdNr. DE221033342
> ++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
> ++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel



---
james anderson | [email protected] | http://dydra.com

Re: NT issues (was: Re: tdbloader2 issues)

Reply via email to