Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available

Paul A. Houle Mon, 12 Aug 2013 09:09:38 -0700

    My feelings are strong towards one-line-per-fact.

    Large RDF data sets have validity problems,  and the difficulty of 
convincing publishers that this matters indicates that this situation will 
continue.


    I’ve thought a bit about the problem of the “streaming converter from 
Turtle to N-Triples”.  It’s true that this can be done in a streaming manner 
most of the time,  but there is a stack that can get infinitely deep in Turtle 
so you can’t say,  strictly,  that memory consumption is bounded.

    It’s also very unclear to me how exactly to work around broken records and 
restart the parser in the general case.  It’s not hard to mock up examples 
where a simple recovery mechanism works,  but I dread the thought of developing 
one for commercial use where I’d probably be playing whack-a-mole for edge 
cases for years.

    There was a gap of quite a few years in the lare-90’s when there weren’t 
usable open-source web browsers because a practical web browser had to:  (1) 
read broken markup,  and (2) render it exactly the same as Netscape 3.  
Commercial operations can get things like this done by burning out programmers, 
 who finally show up at a standup meeting one day,  smash their laptop and 
stomp out.  It’s not so easy in the open source world where you’re forced to 
use carrots and not sticks.

    So far as compression v. inner format I also have some thoughts because for 
every product I’ve made in the last few years I always tried a few different 
packaging methods before releasing something to final.

    Gzip eats up a lot of the ‘markupbloat’ in N-Triples because recently used 
IRIs and prefixes will be in the dictionary.  The minus is that the dictionary 
isn’t very big,  so the contents of the dictionary itself are bloated;  there 
isn’t much entropy there,  but the same markupbloat gets repeated hundreds of 
times;  if you just put the prefixes in a hash table that might be more like 
1000 bytes total to represent that.  When you prefix-compress RDF and gzip it 
then,  you’ve got the advantage that the dictionary contains more entropy than 
it would otherwise.  Even though gzip is not cutting out so much markup bloat,  
it is compressing off a better model of the document so you get better results.

    As has been pointed out,  sorting helps.  If you sort in ?s ?p ?o . order 
it helps,  partially because the sorting itself removes entropy (There are N! 
possible unsorted files and only one sorted one) and obviously the dictionary 
is being set up to roll together common ?s and ?s ?p’s the way turtle does.

    Bzip’s ability to work like a markov chain with the element of chance taken 
out is usually more effective at compression than gzip is,  but I’ve noticed 
some exceptions.  In the original :BaseKB products,  all of the nodes looked 
like

<http://rdf.basekb.com/ns/m.112az>

I found my ?s ?p ?o sorted data compressed better with gzip than bzip,  and 
perhaps the structure of the identifiers had something to do with it.

A big advantage of bzip tha is that the block-based nature of the compression 
means that blocks can be compressed and decompressed in parallel (pbzip2 is a 
drop-in replacements for bzip2),  so that the possible top speed of 
decompressing bzip data is in principle unlimited,  even though bzip is a more 
expensive algorithm  Hadoop in version 1.1.0+ can even automatically decompress 
a bzip2 file and split the result into separate mappers.  Generally system 
performance is better if you read data out of pre-split gzip,  but it is just 
so easy to load a big bz2 in HDFS and point a lot of transistors at it.

I am very much against blank nodes for ‘wiki-ish’ data that is shared between 
systems.  The fact that Freebase reifies “blank nodes” as CVTs means that we 
can talk about them on the outside,  reason about them,  and then name them in 
order to interact with them on the live Freebase system.  By their nature,  
blank nodes defy the “anyone, anything, anywhere” concept because they can’t be 
referred to.  In the case of OWL that’s a feature not a bug because you can 
really close the world because nobody can add anything to a lisp-list without 
introducing a new node.  Outside tiny tiny T-Boxes (say SUMO size),  internal 
DSLs like SPIN,  or expressing algebraic sorts of things (i.e. describe the 
mixed eigenstates of quarks in some Hadron),  the mainstream of linked data 
doesn’t use them.

Personally I’d like to see the data published in Quad form and have the 
reification data expressed in the context field.  As much as possible,  the 
things in the (?s ?p ?o) fields should make sense as facts.  Ideally you could 
reuse one ?c node for a lot of facts,  such as if a number of them came in one 
transactions.  You could ask for the ?c fields (show me all estimates for the 
population of Berlin from 2000 to the present and who made them) or you could 
go through the file of facts and pick the ?c’s that provide the point of view 
that you need the system to have.

_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available

Reply via email to