good afternoon;
On 2015-04-01, at 16:30, Michael Brunnbauer <[email protected]> wrote:
>
> Hello Andy,
>
> it would just be great to have a mode for tdbloader[2] where invalid
> triples/quads are simply ignored.
somehow that seems like a bad idea.
there are already tools which one could use to that end.
in the case of the core wikidata dataset, rapper (which i do not hereby to
elevate to the role of nt conformance accreditation, but anyway) rejects
several thousand statements and can be used to reduce the dataset to those
which are valid.
$ rapper -i ntriples -o ntriples wikidata-statements.nt >
wikidata-statements-clean.nt 2> wikidata-statements-errors.txt
$ ls -l wikidata-statements*
-rw-r--r-- 1 root root 38770855255 Apr 1 16:00 wikidata-statements-clean.nt
-rw-r--r-- 1 root root 1540120 Apr 1 16:00 wikidata-statements-errors.txt
-rw-r--r-- 1 root root 38772450070 Mar 28 08:15 wikidata-statements.nt
$ wc -l wikidata*
233096736 wikidata-statements-clean.nt
9627 wikidata-statements-errors.txt
233106288 wikidata-statements.nt
from which i looks like some of the errors do not cause it to suppress the
statement.
we would be reluctant to host something in that condition as a service, as one
never knows which relations have been eliminated and how central they might be
to the dataset’s utility.
best regards, from berlin,
>
> Regards,
>
> Michael Brunnbauer
>
> On Wed, Apr 01, 2015 at 03:17:08PM +0100, Andy Seaborne wrote:
>> Thanks for that.
>> JENA-911 created.
>>
>> Each of the large public dumps has had quality issues. I'm sure wikidata
>> will fix their process if someone helps them. (Freebase did.)
>>
>> I understand it's frustrating but fixing it in the parser/loader is not a
>> real fix, only a limited workaround, because that data can be passed on to
>> with systems which can't cope. That's what standards are for!!
>>
>>
>> (anyone know who is involved?)
>>
>> The RDF 1.1 took some time to look at orignal-NT - the <>-grammar rule
>> allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not
>> bad) then even things like \n (which was an NL not the characters "\" and
>> "n" as the widedata people are using it) are not getting through. The
>> original NT grammar was specific for test cases and is open and loose by
>> design.
>>
>> Please do feed back to wikidata and we can hope it gets fixed at source.
>>
>> (Ditto DBpedia for that matter)
>>
>> Andy
>>
>> Related: JENA-864
>>
>> NFC and NFCK are two normalization requirements (warnings, not errors) but
>> they seem to be more of a hinderance than a help so I'm suggesting removing
>> the checking. The IRIs are legal even if no NFC - just not in the preferred
>> by W3C form.
>>
>> On 01/04/15 14:11, Michael Brunnbauer wrote:
>>>
>>> Hello Andy,
>>>
>>> [tdbloader2 disk access pattern]
>>>> Lots of unique nodes can slow things down because of all the node writing.
>>>
>>> And there is no way to convert this algorithm to sequential access?
>>>
>>> [tdbloader2 parser]
>>>>>> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those
>>>>>> in IRIs.
>>>>
>>>> Could you provide a set of data with one feature per NTriple line,marking
>>>> in
>>>> a comment what you expect, and I'll check each one and add them to the test
>>>> suite.
>>>
>>> See attachment. I would consider all triples in it illegal according to the
>>> n triples spec.
>>>
>>> If I allow these characters that RFC 1738 calls "unsafe", why then not allow
>>> CR, LF and TAB? And why then allow \\ but not \", which seems to be
>>> sanctioned
>>> by older versions of the spec:
>>>
>>> http://www.w3.org/2001/sw/RDFCore/ntriples/#character
>>>
>>> I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
>>> IRIs, e.G.:
>>>
>>> <http://www.wikidata.org/entity/P1348v>
>>> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
>>> <http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D>
>>> <http://www.wikidata.org/entity/P18v>
>>> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg>
>>> .
>>>
>>> This trial and error cleaning of data dumps with self made scripts and days
>>> between each try is very straining and probably a big deterrent for
>>> newcomers.
>>> I had it with DBpedia and now I have it with Wikidata all over again (with
>>> new syntax problems).
>>>
>>> Regards,
>>>
>>> Michael Brunnbauer
>>>
>
> --
> ++ Michael Brunnbauer
> ++ netEstate GmbH
> ++ Geisenhausener Straße 11a
> ++ 81379 München
> ++ Tel +49 89 32 19 77 80
> ++ Fax +49 89 32 19 77 89
> ++ E-Mail [email protected]
> ++ http://www.netestate.de/
> ++
> ++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
> ++ USt-IdNr. DE221033342
> ++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
> ++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
---
james anderson | [email protected] | http://dydra.com