[Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available

Sebastian Hellmann Sun, 11 Aug 2013 19:01:30 -0700

Hi Markus (cc'ing DBpedia discussion),

of course, you are right with the things you say. Sometimes I have toostrong opinions, but please understand that this is only, since I feellike I wasted too much time on patching bad syntax.The appropriate list for such a discussion would be DBpedia mailing list(now in CC), because it involves best practices for publishing large RDFdata dumps on the Web in a practical manner.We can remove the Wikidata list, if the discussion is not interesting.RDF serialization formats are given as standards, so it is aboutchoosing the right one. DBpedia is one of the projects, that has triedhard to make the Web of Data work.

Of course, I can understand your arguments, but there is a differencebetween a HTML tag soup and the RDF compatibility layer and tool chains.I am confident, that the creation of a robust turtle parser wouldpresent quite a challenge:

@prefix : <http://example.org/ex#> . <s> <pl> """missing quote"" ; <p>:works , :doesn,t , :0neither ,<chars_,;[]_are_allowed_in_full_URIs_by_the_way> ; <p> [ <p> <c ] ,[<p> <d> ] ; <find> <me> .


definitely more than:
while ( line =  readline() ) {
    try {
        parse (line);
    }catch(exception e){System.out("syntax error in: "+line);}
}

So as a best practice, I would definitely go for

**alphabetically sorted, one-triple-per-line, non-prefixed turtle withIRI's with a "not sure about blank nodes"**


Example:

<http://ko.dbpedia.org/resource/지미_카터><http://dbpedia.org/ontology/country> <http://ko.dbpedia.org/resource/미국> .<http://ko.dbpedia.org/resource/지미_카터><http://xmlns.com/foaf/0.1/name> "James Earl Carter, Jr."@ko .

@Markus: actually, the question is important for DBpedia, because diskspace on our download server is getting tight for DBpedia 3.9 and othersoon to come data publishing projects.I'm sorry to use your thread for this, but I see the opportunity tocreate a "best current practise" easily and we might be able to save alot of space by doing so.


Maybe we can skip on NTriples .nt and .nq files?


435G    downloads.dbpedia.org
1.8G    1.0
2.5G    2.0
5.1G    3.0
7.6G    3.0rc
6.0G    3.1
6.4G    3.2
7.3G    3.3
21G    3.4
32G    3.5
35G    3.5.1
34G    3.6
44G    3.7
63G    3.7-i18n
169G    3.8
??? 3.9
22M    wikicompany
1.6G    wiktionary
...

All the best,
Sebastian

Am 10.08.2013 14:35, schrieb Markus Krötzsch:

Dear Sebastian,

On 10/08/13 12:18, Sebastian Hellmann wrote:
Hi Markus!
Thank you very much.

Regarding your last email:
Of course, I am aware of your arguments in your last email, that the
dump is not "official". Nevertheless, I am expecting you and others to
code (or supervise) similar RDF dumping projects in the future.

Here are two really important things to consider:

1. Always use a mature RDF framework for serializing:
...
Statements that involve "always" are easy to disagree with. Animportant part of software engineering is to achieve one's goals withoptimal investment of resources. If you work on larger and morelong-term projects, you will start to appreciate that thetheoretically "best" or "cleanest" solution is not always the one thatleads to a successful project. To the contrary, such a viewpoint caneven make it harder to work in a "messy" surrounding, full of toolsand data that do not quite adhere to the high ideals that one wouldlike everyone (on the Web!) to have. You can see good example of thisin HTML evolution.
Turtle is *really* easy to parse in a robust and fault-tolerant way. Iam tempted to write a little script that sanitizes Turtle input in astreaming fashion by discarding garbage triples. Can't take more thana weekend to do that, don't you think? But I already have plans thisweekend :-)
2. Use NTriples or one-triple-per-line Turtle:
(Turtle supports IRIs and unicode, compare)
curl
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
bzcat | head
curl
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 |
bzcat | head

one-triple-per-line let's you
a) find errors easier and
b) aids further processing, e.g. calculate the outdegree of subjects:
curl
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
bzcat | head -100 | cut -f1 -d '>' | grep -v '^#' | sed 's/<//;s/>//' |
awk '{count[$1]++}END{for(j in count) print "<" j ">" "\t"count [j]}'

Furthermore:
- Parsers can treat one-triple-per-line more robust, by just skippinglines
- compression size is the same
- alphabetical ordering of data works well (e.g. for GitHub diffs)
- you can split the files in several smaller files easily
See above. Why not write a little script that streams a Turtle fileand creates one-triple-per-line output? This could be done with verylittle memory overhead in a streaming fashion. Both nested andline-by-line Turtle have their advantages and disadvantages, but onecan trivially be converted into the other whereas the other cannot beconverted back easily.
Of course we will continue to improve our Turtle quality, but therewill always be someone who would prefer a slightly different format.One will always have to draw a line somewhere.
Blank nodes have some bad properties:
- some databases react weird to them and they sometimes fill up indexes
and make the DB slow (depends on the implementations of course, this is
just my experience )
- make splitting one-triple-per-line more difficult
- difficult for SPARQL to resolve recursively
- see http://videolectures.net/iswc2011_mallea_nodes/ or
http://web.ing.puc.cl/~marenas/publications/iswc11.pdf
Does this relate to Wikidata or are we getting into general RDF designdiscussions here (wrong list)? Wikidata uses blank nodes only forserialising OWL axioms, and there is no alternative in this case.
Turtle prefixes:
Why do you think they are a "good thing"? They are disputed as sometimes
as a premature feature. They do make data more readable, but nobody is
going to read 4.4 GB of Turtle.
If you want to fight against existing W3C standards, this is reallynot the right list. I have not made Turtle, and I won't defend itsdesign here. But since you asked: I think readability is a good thing.
By the way, you can always convert it to turtle easily:
curl
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
bzcat | head -100  | rapper -i turtle -o turtle -I - - file
If conversion is so easy, it does not seem worthwhile to have much ofa discussion about this at all.
Cheers,

Markus
Am 10.08.2013 12:44, schrieb Markus Krötzsch:
Good morning. I just found a bug that was caused by a bug in the
Wikidata dumps (a value that should be a URI was not). This led to a
few dozen lines with illegal qnames of the form "w: ". The updated
script fixes this.

Cheers,

Markus

On 09/08/13 18:15, Markus Krötzsch wrote:
Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:
Hi Markus,
we just had a look at your python code and created a dump. We arestill
getting a syntax error for the turtle dump.
You mean "just" as in "at around 15:30 today" ;-)? The code is under
heavy development, so changes are quite frequent. Please expect things
to be broken in some cases (this is just a little communityproject, not
part of the official Wikidata development).

I have just uploaded a new statements export (20130808) to
http://semanticweb.org/RDF/Wikidata/ which you might want to try.
I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people
(undergraduate
and PhD students, as well as Post-Docs) implement "simple"serializers
for RDF.

They all failed.
This was normally not due to the lack of skill, but due to thelack ofmissing time. They wanted to do it quick, but they didn't have thetime
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special
characters in URIs. I would direly advise you to:

1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with "rapper"
3. use a line by line format, e.g. use turtle without prefixes andjust
one triple per line (It's like NTriples, but with Unicode)
Yes, URI encoding could be difficult if we were doing it manually.Note,
however, that we are already using a standard library for URI encoding
in all non-trivial cases, so this does not seem to be a very likely
cause of the problem (though some non-zero probability remains). In
general, it is not unlikely that there are bugs in the RDF somewhere;
please consider this export as an early prototype that is meant for
experimentation purposes. If you want an official RDF dump, you will
have to wait for the Wikidata project team to get around doing it(thiswill surely be based on an RDF library). Personally, I alreadyfound the
dump useful (I successfully imported some 109 million triples of some
custom script into an RDF store), but I know that it can require some
tweaking.
We are having a problem currently, because we tried to convert thedump
to NTriples (which would be handled by a framework as well) with
rapper.
We assume that the error is an extra "<" somewhere (not confirmed)and
we are still searching for it since the dump is so big....
Ok, looking forward to hear about the results of your search. Agood tip
for checking such things is to use grep. I did a quick grep on my
current local statements export to count the numbers of < and > (this
takes less than a minute on my laptop, including on-the-fly
decompression). Both numbers were equal, making it unlikely that there
is any unmatched < in the current dumps. Then I used grep to checkthat< and > only occur in the statements files in lines with "commons"URLs.
These are created using urllib, so there should never be any < or > in
them.
so we can not provide a detailed bug report. If we had one triple per
line, this would also be easier, plus there are advantages for stream
reading. bzip2 compression is very good as well, no need for prefix
optimization.
Not sure what you mean here. Turtle prefixes in general seem to be a
Good Thing, not just for reducing the file size. The code has no easy
way to get rid of prefixes, but if you want a line-by-line export you
could subclass my exporter and overwrite the methods for incremental
triple writing so that they remember the last subject (or property)and
create full triples instead. This would give you a line-by-line export
in (almost) no time (some uses of [...] blocks in object positionswould
remain, but maybe you could live with that).

Best wishes,

Markus
All the best,
Sebastian

Am 03.08.2013 23:22, schrieb Markus Krötzsch:
Update: the first bugs in the export have already been discovered --
and fixed in the script on github. The files I uploaded will be
updated on Monday when I have a better upload again (the links file
should be fine, the statements file requires a rather tolerantTurtle
string literal parser, and the labels file has a malformed line that
will hardly work anywhere).

Markus

On 03/08/13 14:48, Markus Krötzsch wrote:
Hi,
I am happy to report that an initial, yet fully functional RDFexport
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script
downloads
recent Wikidata database dumps and processes them to create
RDF/Turtle
files. Various options are available to customize the output
(e.g., to
export statements but not references, or to export only texts in
English
and Wolof). The file creation takes a few (about three) hours on my
machine depending on what exactly is exported.

For your convenience, I have created some example exports based on
yesterday's dumps. These can be found at [2]. There are threeTurtlefiles: site links only, labels/descriptions/aliases only,statements
only. The fourth file is a preliminary version of the Wikibase
ontology
that is used in the exports.
The export format is based on our earlier proposal [3], but itadds a
lot of details that had not been specified there yet (namespaces,
references, ID generation, compound datavalue encoding, etc.).
Details
might still change, of course. We might provide regular dumps at
another
location once the format is stable.

As a side effect of these activities, the wda toolkit [1] is also
getting more convenient to use. Creating code for exporting thedata
into other formats is quite easy.

Features and known limitations of the wda RDF export:

(1) All current Wikidata datatypes are supported. Commons-media
data is
correctly exported as URLs (not as strings).

(2) One-pass processing. Dumps are processed only once, even though
this
means that we may not know the types of all properties when wefirst
need them: the script queries wikidata.org to find missing
information.
This is only relevant when exporting statements.

(3) Limited language support. The script uses Wikidata's internal
language codes for string literals in RDF. In some cases, thismight
not
be correct. It would be great if somebody could create a mappingfrom
Wikidata language codes to BCP47 language codes (let me know if you
think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language oflinked
wiki sites, the script extracts a language code from the URL of the
site. Again, this might not be correct in all cases, and itwould begreat if somebody had a proper mapping fromWikipedias/Wikivoyages to
language codes.

(5) Some data excluded. Data that cannot currently be edited is not
exported, even if it is found in the dumps. Examples include
statement
ranks and timezones for time datavalues. I also currently exclude
labels
and descriptions for simple English, formal German, and informal
Dutch,
since these would pollute the label space for English, German, and
Dutch
without adding much benefit (other than possibly for simple English
descriptions, I cannot see any case where these languages shouldever
have different Wikidata texts at all).

Feedback is welcome.

Cheers,

Markus

[1] https://github.com/mkroetzsch/wda
     Run "python wda-export.data.py --help" for usage instructions
[2] http://semanticweb.org/RDF/Wikidata/
[3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:
* NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Extended
Deadline: *July 18th*)
* LSWT 23/24 Sept, 2013 in Leipzig (http://aksw.org/lswt)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org



--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:

* NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, ExtendedDeadline: *July 18th*)

* LSWT 23/24 Sept, 2013 in Leipzig (http://aksw.org/lswt)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf

Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,http://dbpedia.org/Wiktionary , http://dbpedia.org

Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available

Reply via email to