Re: [Wikidata-l] Wikidata RDF export available

2013-08-13 Thread Kingsley Idehen

On 8/12/13 12:56 PM, Nicolas Torzec wrote:

With respect to the RDF export I'd advocate for:
1) an RDF format with one fact per line.
2) the use of a mature/proven RDF generation framework.


Yes, keep it simple, use Turtle.

The additional benefit of Turtle is that is addresses a wide data 
consumer profile i.e., one that extends from the casual end-user all the 
way up to a parser developer.


When producing Turtle, if possible, stay away from Prefixes, also look 
to using relative URIs which  will eliminate complexity and confusion 
that can arise re. Linked Deployment.


Simple rules that have helped me, eternally:

1. denote entities not of type Web Resource or Document using hash based 
HTTP URIs
2. denote source documents (the docs comprised of the data being 
published) using relative URIs via <>

3. stay away from prefixes (they confuse casual end-users).

BTW -- I suspect some might be wandering, isn't this N-Triples? Answer: 
No, because of the use of relative HTTP URIs to denote documents, which 
isn't supported by N-Triples.


A Turtle based RDF model based structured data dump from Wikidata would 
be a might valuable contribution to the Linked Open Data Cloud.


--

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen







smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-12 Thread Markus Krötzsch

On 12/08/13 17:56, Nicolas Torzec wrote:

With respect to the RDF export I'd advocate for:
1) an RDF format with one fact per line.
2) the use of a mature/proven RDF generation framework.

Optimizing too early based on a limited and/or biased view of the
potential use cases may not be a good idea in the long run.
I'd rather keep it simple and standard at the data publishing level, and
let consumers access data easily and optimize processing to their need.


RDF has several official, standardised syntaxes, and one of them is 
Turtle. Using it is not a form of optimisation, just a choice of syntax. 
Every tool I have ever used for serious RDF work (triple stores, 
libraries, even OWL tools) supports any of the standard RDF syntaxes 
*just as well*. I do see that there are some advantages in some formats 
and others in others (I agree with most arguments that have been put 
forward). But would it not be better to first take a look at the actual 
content rather than debating the syntactic formatting now? As I said, 
this is not the final syntax anyway, which will be created with 
different code in a different programming language.




Also, I should not have to run a preprocessing step for filtering out the
pieces of data that do not follow the standardŠ


To the best of our knowledge, there are no such pieces in the current 
dump. We should try to keep this conversation somewhat related to the 
actual Wikidata dump that is created by the current version of the 
Python script on github (I will also upload a dump again tomorrow; 
currently, you can only get the dump by running the script yourself). I 
know I suggested that one could parse Turtle in a robust way (which I 
still think one can) but I am not suggesting for a moment that this 
should be necessary for using Wikidata dumps in the future. I am 
committed to fix any error as it is found, but so far I don't get much 
input in that direction.




Note that I also understand the need for a format that groups every facts
about an subject into one record, and serialize them one record per line.
It sometime makes life easier for bulk processing of large datasets. But
that's a different discussion.



As I said: advantages and disadvantages. This is why we will probably 
have all desired formats at some time. But someone needs to start somewhere.


Markus




--
Nicolas Torzec.












On 8/12/13 1:49 AM, "Markus Krötzsch" 
wrote:


On 11/08/13 22:29, Tom Morris wrote:

On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch
mailto:mar...@semantic-mediawiki.org>>
wrote:

 Anyway, if you restrict yourself to tools that are installed by
 default on your system, then it will be difficult to do many
 interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump
 is really meant specifically for tools that take RDF inputs. It is
 not very straightforward to encode all of Wikidata in triples, and
 it leads to some inconvenient constructions (especially a lot of
 reification). If you don't actually want to use an RDF tool and you
 are just interested in the data, then there would be easier ways of
 getting it.


A single fact per line seems like a pretty convenient format to me.
   What format do you recommend that's easier to process?


I'd suggest some custom format that at least keeps single data values in
one line. For example, in RDF, you have to do two joins to find all
items that have a property with a date in the year 2010. Even with a
line-by-line format, you will not be able to grep this. So I think a
less normalised representation would be nicer for direct text-based
processing. For text-based processing, I would probably prefer a format
where one statement is encoded on one line. But it really depends on
what you want to do. Maybe you could also remove some data to obtain
something that is easier to process.

Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-12 Thread Nicolas Torzec
With respect to the RDF export I'd advocate for:
1) an RDF format with one fact per line.
2) the use of a mature/proven RDF generation framework.

Optimizing too early based on a limited and/or biased view of the
potential use cases may not be a good idea in the long run.
I'd rather keep it simple and standard at the data publishing level, and
let consumers access data easily and optimize processing to their need.

Also, I should not have to run a preprocessing step for filtering out the
pieces of data that do not follow the standardŠ



Note that I also understand the need for a format that groups every facts
about an subject into one record, and serialize them one record per line.
It sometime makes life easier for bulk processing of large datasets. But
that's a different discussion.





--
Nicolas Torzec.












On 8/12/13 1:49 AM, "Markus Krötzsch" 
wrote:

>On 11/08/13 22:29, Tom Morris wrote:
>> On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch
>> mailto:mar...@semantic-mediawiki.org>>
>> wrote:
>>
>> Anyway, if you restrict yourself to tools that are installed by
>> default on your system, then it will be difficult to do many
>> interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump
>> is really meant specifically for tools that take RDF inputs. It is
>> not very straightforward to encode all of Wikidata in triples, and
>> it leads to some inconvenient constructions (especially a lot of
>> reification). If you don't actually want to use an RDF tool and you
>> are just interested in the data, then there would be easier ways of
>> getting it.
>>
>>
>> A single fact per line seems like a pretty convenient format to me.
>>   What format do you recommend that's easier to process?
>
>I'd suggest some custom format that at least keeps single data values in
>one line. For example, in RDF, you have to do two joins to find all
>items that have a property with a date in the year 2010. Even with a
>line-by-line format, you will not be able to grep this. So I think a
>less normalised representation would be nicer for direct text-based
>processing. For text-based processing, I would probably prefer a format
>where one statement is encoded on one line. But it really depends on
>what you want to do. Maybe you could also remove some data to obtain
>something that is easier to process.
>
>Markus
>
>
>___
>Wikidata-l mailing list
>Wikidata-l@lists.wikimedia.org
>https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-12 Thread Markus Krötzsch

On 11/08/13 22:29, Tom Morris wrote:

On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch
mailto:mar...@semantic-mediawiki.org>>
wrote:

Anyway, if you restrict yourself to tools that are installed by
default on your system, then it will be difficult to do many
interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump
is really meant specifically for tools that take RDF inputs. It is
not very straightforward to encode all of Wikidata in triples, and
it leads to some inconvenient constructions (especially a lot of
reification). If you don't actually want to use an RDF tool and you
are just interested in the data, then there would be easier ways of
getting it.


A single fact per line seems like a pretty convenient format to me.
  What format do you recommend that's easier to process?


I'd suggest some custom format that at least keeps single data values in 
one line. For example, in RDF, you have to do two joins to find all 
items that have a property with a date in the year 2010. Even with a 
line-by-line format, you will not be able to grep this. So I think a 
less normalised representation would be nicer for direct text-based 
processing. For text-based processing, I would probably prefer a format 
where one statement is encoded on one line. But it really depends on 
what you want to do. Maybe you could also remove some data to obtain 
something that is easier to process.


Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-11 Thread Tom Morris
On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> Anyway, if you restrict yourself to tools that are installed by default on
> your system, then it will be difficult to do many interesting things with a
> 4.5G RDF file ;-) Seriously, the RDF dump is really meant specifically for
> tools that take RDF inputs. It is not very straightforward to encode all of
> Wikidata in triples, and it leads to some inconvenient constructions
> (especially a lot of reification). If you don't actually want to use an RDF
> tool and you are just interested in the data, then there would be easier
> ways of getting it.
>

A single fact per line seems like a pretty convenient format to me.  What
format do you recommend that's easier to process?

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-10 Thread Markus Krötzsch

Hi Tom,

On 10/08/13 15:55, Tom Morris wrote:

Given your "educating" people about software engineering principles,
this may fall on deaf ears, but I too have a strong preference for the
format with an independent line per triple.


No worries. The eventual RDF export of Wikidata will most certainly have 
this (and any other standard format one could want). If you need 
NTriples export earlier, but do not want to use a second tool for this, 
then you could modify the triple writing methods in the python script as 
I suggested a few emails ago.




On Sat, Aug 10, 2013 at 8:35 AM, Markus Krötzsch
mailto:markus.kroetz...@cs.ox.ac.uk>> wrote:

On 10/08/13 12:18, Sebastian Hellmann wrote:


By the way, you can always convert it to turtle easily:
curl

http://downloads.dbpedia.org/__3.8/ko/mappingbased___properties_ko.ttl.bz2

|
bzcat | head -100  | rapper -i turtle -o turtle -I - - file


If conversion is so easy, it does not seem worthwhile to have much
of a discussion about this at all.


The point of the discussion to advocate for a format that is most useful
to the maximum number of people out of the box. Rapper isn't installed
by default on systems.  A file format with independent lines can be
processed using grep and other simple command line tools without having
to find and install additional software.


I think the rapper command you refer to was only for expanding prefixes, 
not for making line-by-line syntax. Prefixes should not create any 
greping inconveniences.


Anyway, if you restrict yourself to tools that are installed by default 
on your system, then it will be difficult to do many interesting things 
with a 4.5G RDF file ;-) Seriously, the RDF dump is really meant 
specifically for tools that take RDF inputs. It is not very 
straightforward to encode all of Wikidata in triples, and it leads to 
some inconvenient constructions (especially a lot of reification). If 
you don't actually want to use an RDF tool and you are just interested 
in the data, then there would be easier ways of getting it.


Out of curiosity, what kind of use do you have in mind for the RDF (or 
for the data in general)?


Cheers,

Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-10 Thread Tom Morris
Given your "educating" people about software engineering principles, this
may fall on deaf ears, but I too have a strong preference for the format
with an independent line per triple.

On Sat, Aug 10, 2013 at 8:35 AM, Markus Krötzsch <
markus.kroetz...@cs.ox.ac.uk> wrote:
>
> On 10/08/13 12:18, Sebastian Hellmann wrote:
>
>>
>>  By the way, you can always convert it to turtle easily:
>> curl
>> http://downloads.dbpedia.org/**3.8/ko/mappingbased_**
>> properties_ko.ttl.bz2|
>> bzcat | head -100  | rapper -i turtle -o turtle -I - - file
>>
>
> If conversion is so easy, it does not seem worthwhile to have much of a
> discussion about this at all.
>

The point of the discussion to advocate for a format that is most useful to
the maximum number of people out of the box. Rapper isn't installed by
default on systems.  A file format with independent lines can be processed
using grep and other simple command line tools without having to find and
install additional software.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-10 Thread Markus Krötzsch

Dear Sebastian,

On 10/08/13 12:18, Sebastian Hellmann wrote:

Hi Markus!
Thank you very much.

Regarding your last email:
Of course, I am aware of your arguments in your last email, that the
dump is not "official". Nevertheless, I am expecting you and others to
code (or supervise) similar RDF dumping projects in the future.

Here are two really important things to consider:

1. Always use a mature RDF framework for serializing:

...

Statements that involve "always" are easy to disagree with. An important 
part of software engineering is to achieve one's goals with optimal 
investment of resources. If you work on larger and more long-term 
projects, you will start to appreciate that the theoretically "best" or 
"cleanest" solution is not always the one that leads to a successful 
project. To the contrary, such a viewpoint can even make it harder to 
work in a "messy" surrounding, full of tools and data that do not quite 
adhere to the high ideals that one would like everyone (on the Web!) to 
have. You can see good example of this in HTML evolution.


Turtle is *really* easy to parse in a robust and fault-tolerant way. I 
am tempted to write a little script that sanitizes Turtle input in a 
streaming fashion by discarding garbage triples. Can't take more than a 
weekend to do that, don't you think? But I already have plans this 
weekend :-)




2. Use NTriples or one-triple-per-line Turtle:
(Turtle supports IRIs and unicode, compare)
curl
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
bzcat | head
curl
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 |
bzcat | head

one-triple-per-line let's you
a) find errors easier and
b) aids further processing, e.g. calculate the outdegree of subjects:
curl
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
bzcat | head -100 | cut -f1 -d '>' | grep -v '^#' | sed 's///' |
awk '{count[$1]++}END{for(j in count) print "<" j ">" "\t"count [j]}'

Furthermore:
- Parsers can treat one-triple-per-line more robust, by just skipping lines
- compression size is the same
- alphabetical ordering of data works well (e.g. for GitHub diffs)
- you can split the files in several smaller files easily


See above. Why not write a little script that streams a Turtle file and 
creates one-triple-per-line output? This could be done with very little 
memory overhead in a streaming fashion. Both nested and line-by-line 
Turtle have their advantages and disadvantages, but one can trivially be 
converted into the other whereas the other cannot be converted back easily.


Of course we will continue to improve our Turtle quality, but there will 
always be someone who would prefer a slightly different format. One will 
always have to draw a line somewhere.





Blank nodes have some bad properties:
- some databases react weird to them and they sometimes fill up indexes
and make the DB slow (depends on the implementations of course, this is
just my experience )
- make splitting one-triple-per-line more difficult
- difficult for SPARQL to resolve recursively
- see http://videolectures.net/iswc2011_mallea_nodes/ or
http://web.ing.puc.cl/~marenas/publications/iswc11.pdf


Does this relate to Wikidata or are we getting into general RDF design 
discussions here (wrong list)? Wikidata uses blank nodes only for 
serialising OWL axioms, and there is no alternative in this case.





Turtle prefixes:
Why do you think they are a "good thing"? They are disputed as sometimes
as a premature feature. They do make data more readable, but nobody is
going to read 4.4 GB of Turtle.


If you want to fight against existing W3C standards, this is really not 
the right list. I have not made Turtle, and I won't defend its design 
here. But since you asked: I think readability is a good thing.



By the way, you can always convert it to turtle easily:
curl
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
bzcat | head -100  | rapper -i turtle -o turtle -I - - file


If conversion is so easy, it does not seem worthwhile to have much of a 
discussion about this at all.


Cheers,

Markus




Am 10.08.2013 12:44, schrieb Markus Krötzsch:

Good morning. I just found a bug that was caused by a bug in the
Wikidata dumps (a value that should be a URI was not). This led to a
few dozen lines with illegal qnames of the form "w: ". The updated
script fixes this.

Cheers,

Markus

On 09/08/13 18:15, Markus Krötzsch wrote:

Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.


You mean "just" as in "at around 15:30 today" ;-)? The code is under
heavy development, so changes are quite frequent. Please expect things
to be broken in some cases (this is just a little community project, not
part of the official Wikidata development).

I have just uploaded a new statements export (20130808) to
http://semanticweb.org/R

Re: [Wikidata-l] Wikidata RDF export available

2013-08-10 Thread Sebastian Hellmann

Hi Markus!
Thank you very much.

Regarding your last email:
Of course, I am aware of your arguments in your last email, that the 
dump is not "official". Nevertheless, I am expecting you and others to 
code (or supervise) similar RDF dumping projects in the future.


Here are two really important things to consider:

1. Always use a mature RDF framework for serializing:
Even DBpedia was publishing RDF for years that had some errors in it, 
this was really frustrating for maintainers (handling bug reports) and 
clients (trying to quick-fix it).
Other small projects (in fact exactly the same as yours Markus, a guy 
publishing some useful software) went the same way: Lot's of small 
syntax bugs, many bug requests, lot of additional work. Some of them 
were abandoned because the developer didn't have time anymore.


2. Use NTriples or one-triple-per-line Turtle:
(Turtle supports IRIs and unicode, compare)
curl 
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | 
bzcat | head
curl 
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 | 
bzcat | head


one-triple-per-line let's you
a) find errors easier and
b) aids further processing, e.g. calculate the outdegree of subjects:
curl 
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | 
bzcat | head -100 | cut -f1 -d '>' | grep -v '^#' | sed 's///' | 
awk '{count[$1]++}END{for(j in count) print "<" j ">" "\t"count [j]}'


Furthermore:
- Parsers can treat one-triple-per-line more robust, by just skipping lines
- compression size is the same
- alphabetical ordering of data works well (e.g. for GitHub diffs)
- you can split the files in several smaller files easily


Blank nodes have some bad properties:
- some databases react weird to them and they sometimes fill up indexes 
and make the DB slow (depends on the implementations of course, this is 
just my experience )

- make splitting one-triple-per-line more difficult
- difficult for SPARQL to resolve recursively
- see http://videolectures.net/iswc2011_mallea_nodes/ or 
http://web.ing.puc.cl/~marenas/publications/iswc11.pdf



Turtle prefixes:
Why do you think they are a "good thing"? They are disputed as sometimes 
as a premature feature. They do make data more readable, but nobody is 
going to read 4.4 GB of Turtle.

By the way, you can always convert it to turtle easily:
curl 
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | 
bzcat | head -100  | rapper -i turtle -o turtle -I - - file


All the best,
Sebastian



Am 10.08.2013 12:44, schrieb Markus Krötzsch:
Good morning. I just found a bug that was caused by a bug in the 
Wikidata dumps (a value that should be a URI was not). This led to a 
few dozen lines with illegal qnames of the form "w: ". The updated 
script fixes this.


Cheers,

Markus

On 09/08/13 18:15, Markus Krötzsch wrote:

Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.


You mean "just" as in "at around 15:30 today" ;-)? The code is under
heavy development, so changes are quite frequent. Please expect things
to be broken in some cases (this is just a little community project, not
part of the official Wikidata development).

I have just uploaded a new statements export (20130808) to
http://semanticweb.org/RDF/Wikidata/ which you might want to try.



I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people 
(undergraduate

and PhD students, as well as Post-Docs) implement "simple" serializers
for RDF.

They all failed.

This was normally not due to the lack of skill, but due to the lack of
missing time. They wanted to do it quick, but they didn't have the time
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special
characters in URIs. I would direly advise you to:

1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with "rapper"
3. use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)


Yes, URI encoding could be difficult if we were doing it manually. Note,
however, that we are already using a standard library for URI encoding
in all non-trivial cases, so this does not seem to be a very likely
cause of the problem (though some non-zero probability remains). In
general, it is not unlikely that there are bugs in the RDF somewhere;
please consider this export as an early prototype that is meant for
experimentation purposes. If you want an official RDF dump, you will
have to wait for the Wikidata project team to get around doing it (this
will surely be based on an RDF library). Personally, I already found the
dump useful (I successfully imported some 109 million triples of some
custom script into an RDF store), but I know that i

Re: [Wikidata-l] Wikidata RDF export available

2013-08-10 Thread Markus Krötzsch
Good morning. I just found a bug that was caused by a bug in the 
Wikidata dumps (a value that should be a URI was not). This led to a few 
dozen lines with illegal qnames of the form "w: ". The updated script 
fixes this.


Cheers,

Markus

On 09/08/13 18:15, Markus Krötzsch wrote:

Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.


You mean "just" as in "at around 15:30 today" ;-)? The code is under
heavy development, so changes are quite frequent. Please expect things
to be broken in some cases (this is just a little community project, not
part of the official Wikidata development).

I have just uploaded a new statements export (20130808) to
http://semanticweb.org/RDF/Wikidata/ which you might want to try.



I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people (undergraduate
and PhD students, as well as Post-Docs) implement "simple" serializers
for RDF.

They all failed.

This was normally not due to the lack of skill, but due to the lack of
missing time. They wanted to do it quick, but they didn't have the time
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special
characters in URIs. I would direly advise you to:

1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with "rapper"
3. use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)


Yes, URI encoding could be difficult if we were doing it manually. Note,
however, that we are already using a standard library for URI encoding
in all non-trivial cases, so this does not seem to be a very likely
cause of the problem (though some non-zero probability remains). In
general, it is not unlikely that there are bugs in the RDF somewhere;
please consider this export as an early prototype that is meant for
experimentation purposes. If you want an official RDF dump, you will
have to wait for the Wikidata project team to get around doing it (this
will surely be based on an RDF library). Personally, I already found the
dump useful (I successfully imported some 109 million triples of some
custom script into an RDF store), but I know that it can require some
tweaking.



We are having a problem currently, because we tried to convert the dump
to NTriples (which would be handled by a framework as well) with rapper.
We assume that the error is an extra "<" somewhere (not confirmed) and
we are still searching for it since the dump is so big


Ok, looking forward to hear about the results of your search. A good tip
for checking such things is to use grep. I did a quick grep on my
current local statements export to count the numbers of < and > (this
takes less than a minute on my laptop, including on-the-fly
decompression). Both numbers were equal, making it unlikely that there
is any unmatched < in the current dumps. Then I used grep to check that
< and > only occur in the statements files in lines with "commons" URLs.
These are created using urllib, so there should never be any < or > in
them.


so we can not provide a detailed bug report. If we had one triple per
line, this would also be easier, plus there are advantages for stream
reading. bzip2 compression is very good as well, no need for prefix
optimization.


Not sure what you mean here. Turtle prefixes in general seem to be a
Good Thing, not just for reducing the file size. The code has no easy
way to get rid of prefixes, but if you want a line-by-line export you
could subclass my exporter and overwrite the methods for incremental
triple writing so that they remember the last subject (or property) and
create full triples instead. This would give you a line-by-line export
in (almost) no time (some uses of [...] blocks in object positions would
remain, but maybe you could live with that).

Best wishes,

Markus



All the best,
Sebastian

Am 03.08.2013 23:22, schrieb Markus Krötzsch:

Update: the first bugs in the export have already been discovered --
and fixed in the script on github. The files I uploaded will be
updated on Monday when I have a better upload again (the links file
should be fine, the statements file requires a rather tolerant Turtle
string literal parser, and the labels file has a malformed line that
will hardly work anywhere).

Markus

On 03/08/13 14:48, Markus Krötzsch wrote:

Hi,

I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in
English

Re: [Wikidata-l] Wikidata RDF export available

2013-08-09 Thread Markus Krötzsch

Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.


You mean "just" as in "at around 15:30 today" ;-)? The code is under 
heavy development, so changes are quite frequent. Please expect things 
to be broken in some cases (this is just a little community project, not 
part of the official Wikidata development).


I have just uploaded a new statements export (20130808) to 
http://semanticweb.org/RDF/Wikidata/ which you might want to try.




I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people (undergraduate
and PhD students, as well as Post-Docs) implement "simple" serializers
for RDF.

They all failed.

This was normally not due to the lack of skill, but due to the lack of
missing time. They wanted to do it quick, but they didn't have the time
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special
characters in URIs. I would direly advise you to:

1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with "rapper"
3. use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)


Yes, URI encoding could be difficult if we were doing it manually. Note, 
however, that we are already using a standard library for URI encoding 
in all non-trivial cases, so this does not seem to be a very likely 
cause of the problem (though some non-zero probability remains). In 
general, it is not unlikely that there are bugs in the RDF somewhere; 
please consider this export as an early prototype that is meant for 
experimentation purposes. If you want an official RDF dump, you will 
have to wait for the Wikidata project team to get around doing it (this 
will surely be based on an RDF library). Personally, I already found the 
dump useful (I successfully imported some 109 million triples of some 
custom script into an RDF store), but I know that it can require some 
tweaking.




We are having a problem currently, because we tried to convert the dump
to NTriples (which would be handled by a framework as well) with rapper.
We assume that the error is an extra "<" somewhere (not confirmed) and
we are still searching for it since the dump is so big


Ok, looking forward to hear about the results of your search. A good tip 
for checking such things is to use grep. I did a quick grep on my 
current local statements export to count the numbers of < and > (this 
takes less than a minute on my laptop, including on-the-fly 
decompression). Both numbers were equal, making it unlikely that there 
is any unmatched < in the current dumps. Then I used grep to check that 
< and > only occur in the statements files in lines with "commons" URLs. 
These are created using urllib, so there should never be any < or > in them.



so we can not provide a detailed bug report. If we had one triple per
line, this would also be easier, plus there are advantages for stream
reading. bzip2 compression is very good as well, no need for prefix
optimization.


Not sure what you mean here. Turtle prefixes in general seem to be a 
Good Thing, not just for reducing the file size. The code has no easy 
way to get rid of prefixes, but if you want a line-by-line export you 
could subclass my exporter and overwrite the methods for incremental 
triple writing so that they remember the last subject (or property) and 
create full triples instead. This would give you a line-by-line export 
in (almost) no time (some uses of [...] blocks in object positions would 
remain, but maybe you could live with that).


Best wishes,

Markus



All the best,
Sebastian

Am 03.08.2013 23:22, schrieb Markus Krötzsch:

Update: the first bugs in the export have already been discovered --
and fixed in the script on github. The files I uploaded will be
updated on Monday when I have a better upload again (the links file
should be fine, the statements file requires a rather tolerant Turtle
string literal parser, and the labels file has a malformed line that
will hardly work anywhere).

Markus

On 03/08/13 14:48, Markus Krötzsch wrote:

Hi,

I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in English
and Wolof). The file creation takes a few (about three) hours on my
machine depending on what exactly is exported.

For your convenience, I have created some example exports based on
yesterday's dumps. These can be found at [2]. There are three Turtl

Re: [Wikidata-l] Wikidata RDF export available

2013-08-09 Thread Paul A. Houle
   Over time people have gotten the message that you shouldn't write XML 
like


   System.out.println(""+someString+"")

   because it is something that usually ends in tears.

   Although (most) RDF toolkits are like XML toolkits in that they choke on 
invalid data,  people who write RDF seem to have little concern of whether 
or not it is valid.  This cultural problem is one of the reasons why RDF has 
seemed to catch on so slow.  If you told somebody their XML is invalid, 
they'll feel like they have to do,  but people don't seem to take any action 
when they hear that the 20 GB file they published is trash.


   As a general practice you should use real RDF tools to write RDF files. 
This adds some overhead,  but it's generally not hard and it gives you a 
pretty good chance you'll get valid output. ;-)


   Lately I've been working on this system

https://github.com/paulhoule/infovore/wiki

   which is intended to deal with exactly this situation on a large scale. 
The "Parallel Super Eyeball 3" (3 means triple,  PSE 4 is a hypothetical 
tool that does the same for quads) tool physically separates valid and 
invalid triples so you can use the valid triples while being aware of what 
invalid data tried to sneak it.


   Early next week I'm planning on rolling out ":BaseKB Now" which will be 
filtered Freebase data,  processed automatically on a weekly basis.  I've 
got a project in the pipeline that are going to require Wikipedia Categories 
(I better get them fast before they go away) and another large 4D 
metamemomic data set for which Wikidata Phase I will be a Rosetta Stone so 
support for those data sets are on my critical path.


-Original Message- 
From: Sebastian Hellmann

Sent: Friday, August 9, 2013 10:44 AM
To: Discussion list for the Wikidata project.
Cc: Dimitris Kontokostas ; Jona Christopher Sahnwaldt
Subject: Re: [Wikidata-l] Wikidata RDF export available

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.

I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people (undergraduate
and PhD students, as well as Post-Docs) implement "simple" serializers
for RDF.

They all failed.


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-09 Thread Sebastian Hellmann

Hi Markus,
we just had a look at your python code and created a dump. We are still 
getting a syntax error for the turtle dump.


I saw, that you did not use a mature framework for serializing the 
turtle. Let me explain the problem:


Over the last 4 years, I have seen about two dozen people (undergraduate 
and PhD students, as well as Post-Docs) implement "simple" serializers 
for RDF.


They all failed.

This was normally not due to the lack of skill, but due to the lack of 
missing time. They wanted to do it quick, but they didn't have the time 
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special 
characters in URIs. I would direly advise you to:


1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with "rapper"
3. use a line by line format, e.g. use turtle without prefixes and just 
one triple per line (It's like NTriples, but with Unicode)


We are having a problem currently, because we tried to convert the dump 
to NTriples (which would be handled by a framework as well) with rapper.
We assume that the error is an extra "<" somewhere (not confirmed) and 
we are still searching for it since the dump is so big
so we can not provide a detailed bug report. If we had one triple per 
line, this would also be easier, plus there are advantages for stream 
reading. bzip2 compression is very good as well, no need for prefix 
optimization.


All the best,
Sebastian

Am 03.08.2013 23:22, schrieb Markus Krötzsch:
Update: the first bugs in the export have already been discovered -- 
and fixed in the script on github. The files I uploaded will be 
updated on Monday when I have a better upload again (the links file 
should be fine, the statements file requires a rather tolerant Turtle 
string literal parser, and the labels file has a malformed line that 
will hardly work anywhere).


Markus

On 03/08/13 14:48, Markus Krötzsch wrote:

Hi,

I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in English
and Wolof). The file creation takes a few (about three) hours on my
machine depending on what exactly is exported.

For your convenience, I have created some example exports based on
yesterday's dumps. These can be found at [2]. There are three Turtle
files: site links only, labels/descriptions/aliases only, statements
only. The fourth file is a preliminary version of the Wikibase ontology
that is used in the exports.

The export format is based on our earlier proposal [3], but it adds a
lot of details that had not been specified there yet (namespaces,
references, ID generation, compound datavalue encoding, etc.). Details
might still change, of course. We might provide regular dumps at another
location once the format is stable.

As a side effect of these activities, the wda toolkit [1] is also
getting more convenient to use. Creating code for exporting the data
into other formats is quite easy.

Features and known limitations of the wda RDF export:

(1) All current Wikidata datatypes are supported. Commons-media data is
correctly exported as URLs (not as strings).

(2) One-pass processing. Dumps are processed only once, even though this
means that we may not know the types of all properties when we first
need them: the script queries wikidata.org to find missing information.
This is only relevant when exporting statements.

(3) Limited language support. The script uses Wikidata's internal
language codes for string literals in RDF. In some cases, this might not
be correct. It would be great if somebody could create a mapping from
Wikidata language codes to BCP47 language codes (let me know if you
think you can do this, and I'll tell you where to put it)

(4) Limited site language support. To specify the language of linked
wiki sites, the script extracts a language code from the URL of the
site. Again, this might not be correct in all cases, and it would be
great if somebody had a proper mapping from Wikipedias/Wikivoyages to
language codes.

(5) Some data excluded. Data that cannot currently be edited is not
exported, even if it is found in the dumps. Examples include statement
ranks and timezones for time datavalues. I also currently exclude labels
and descriptions for simple English, formal German, and informal Dutch,
since these would pollute the label space for English, German, and Dutch
without adding much benefit (other than possibly for simple English
descriptions, I cannot see any case where these languages should ever
have different Wikidata texts at all).

Feedback is welcome.

Cheers,

Markus

[1] https://github.com/mkroetzsch/wda
 Run "python wda-export.data.py -

Re: [Wikidata-l] Wikidata RDF export available

2013-08-04 Thread Federico Leva (Nemo)

Markus Krötzsch, 04/08/2013 17:35:

Are you sure? The file you linked has mappings from site ids to language
codes, not from language codes to language codes. Do you mean to say:
"If you take only the entries of the form 'XXXwiki' in the list, and
extract a language code from the XXX, then you get a mapping from
language codes to language codes that covers all exceptions in
Wikidata"?


Yes. You said Wikidata just uses the subdomain and the subdomain is 
contained in the database names used by the config. Sorry if I implied 
the removal of the wik* suffix and the conversion from _ to -



This approach would give us:

'als' : 'gsw',
'bat-smg': 'sgs',
'be_x_old' : 'be-tarask',
'crh': 'crh-latn',
'fiu_vro': 'vro',
'no' : 'nb',
'roa-rup': 'rup',
'zh-classical' : 'lzh'
'zh-min-nan': 'nan',
'zh-yue': 'yue'

Each of the values on the left here also occur as language tags in
Wikidata, so if we map them, we use the same tag for things that
Wikidata has distinct tags for. For example, Q27 has a label for yue but
also for zh-yue [1]. It seems to be wrong to export both of these with
the same language tag if Wikidata uses them for different purposes.

Maybe this is a bug in Wikidata and we should just not export texts with
any of the above codes at all (since they always are given by another
tag directly)?


Sorry, I don't know why both can appear. I would have said that one is a 
sitelink and the other some value added on wiki with the correct 
language code (entry label?) but my limited json reading skills seem to 
indicate otherwise.



[...]

Well, the obvious: if a language used in Wikidata labels or on Wikimedia
sites has an official IANA code [2],


(And all of them are supposed to, except rare exceptions with pre-2006 
wikis.)



then we should use this code. Every
other code would be "wrong". For languages that do not have any accurate
code, we should probably use a private code, following the requirements
of BCP 47 for private use subtags (in particular, they should have a
single x somewhere).

This does not seem to be done correctly by my current code. For example,
we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are
lANA language tags, I am not sure that their combination makes sense.
The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and
it is a language code, not a dialect code). Note that map-bms does not
occur in the file you linked to, so I guess there is some more work to do.


Indeed, that appears to be one of the exceptions. :) I don't know how it 
should be tracked, you could file a bug in 
MediaWiki>Internationalisation asking to find a proper code for this 
language.
What was unclear to me is why you implied there were many such cases; 
that would surprise me.


Nemo

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-04 Thread Markus Krötzsch

On 04/08/13 13:17, Federico Leva (Nemo) wrote:

Markus Krötzsch, 04/08/2013 12:32:

* Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be a
mapping somewhere. Where?


Where I linked it.


Are you sure? The file you linked has mappings from site ids to language 
codes, not from language codes to language codes. Do you mean to say: 
"If you take only the entries of the form 'XXXwiki' in the list, and 
extract a language code from the XXX, then you get a mapping from 
language codes to language codes that covers all exceptions in 
Wikidata"? This approach would give us:


'als' : 'gsw',
'bat-smg': 'sgs',
'be_x_old' : 'be-tarask',
'crh': 'crh-latn',
'fiu_vro': 'vro',
'no' : 'nb',
'roa-rup': 'rup',
'zh-classical' : 'lzh'
'zh-min-nan': 'nan',
'zh-yue': 'yue'

Each of the values on the left here also occur as language tags in 
Wikidata, so if we map them, we use the same tag for things that 
Wikidata has distinct tags for. For example, Q27 has a label for yue but 
also for zh-yue [1]. It seems to be wrong to export both of these with 
the same language tag if Wikidata uses them for different purposes.


Maybe this is a bug in Wikidata and we should just not export texts with 
any of the above codes at all (since they always are given by another 
tag directly)?





* MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
provides some mappings. For example, it maps "zh-yue" to "yue". Yet,
Wikidata use both of these codes. What does this mean?

Answers to Nemo's points inline:

On 04/08/13 06:15, Federico Leva (Nemo) wrote:

Markus Krötzsch, 03/08/2013 15:48:


...


Apart from the above, doesn't wgLanguageCode in
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
have what you need?


Interesting. However, the list there does not contain all 300 sites that
we currently find in Wikidata dumps (and some that we do not find there,
including things like dkwiki that seem to be outdated). The full list of
sites we support is also found in the file I mentioned above, just after
the language list (variable siteLanguageCodes).


Of course not all wikis are there, that configuration is needed only
when the subdomain is "wrong". It's still not clear to me what codes you
are considering wrong.


Well, the obvious: if a language used in Wikidata labels or on Wikimedia 
sites has an official IANA code [2], then we should use this code. Every 
other code would be "wrong". For languages that do not have any accurate 
code, we should probably use a private code, following the requirements 
of BCP 47 for private use subtags (in particular, they should have a 
single x somewhere).


This does not seem to be done correctly by my current code. For example, 
we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are 
lANA language tags, I am not sure that their combination makes sense. 
The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and 
it is a language code, not a dialect code). Note that map-bms does not 
occur in the file you linked to, so I guess there is some more work to do.


Markus

[1] http://www.wikidata.org/wiki/Special:Export/Q27
[2] 
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-04 Thread Federico Leva (Nemo)

Markus Krötzsch, 04/08/2013 12:32:

* Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be a
mapping somewhere. Where?


Where I linked it.


* MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
provides some mappings. For example, it maps "zh-yue" to "yue". Yet,
Wikidata use both of these codes. What does this mean?

Answers to Nemo's points inline:

On 04/08/13 06:15, Federico Leva (Nemo) wrote:

Markus Krötzsch, 03/08/2013 15:48:

(3) Limited language support. The script uses Wikidata's internal
language codes for string literals in RDF. In some cases, this might not
be correct. It would be great if somebody could create a mapping from
Wikidata language codes to BCP47 language codes (let me know if you
think you can do this, and I'll tell you where to put it)


These are only a handful, aren't they?


There are about 369 language codes right now. You can see the complete
list in langCodes at the bottom of the file

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py


Most might be correct already, but it is hard to say.


Only a handful are incorrect, unless Wikidata has specific problems (no 
idea how you reach 369).



Also, is it okay
to create new (sub)language codes for our own purposes? Something like
simple English will hardly have an official code, but it would be bad to
export is as "en".




(4) Limited site language support. To specify the language of linked
wiki sites, the script extracts a language code from the URL of the
site. Again, this might not be correct in all cases, and it would be
great if somebody had a proper mapping from Wikipedias/Wikivoyages to
language codes.


Apart from the above, doesn't wgLanguageCode in
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
have what you need?


Interesting. However, the list there does not contain all 300 sites that
we currently find in Wikidata dumps (and some that we do not find there,
including things like dkwiki that seem to be outdated). The full list of
sites we support is also found in the file I mentioned above, just after
the language list (variable siteLanguageCodes).


Of course not all wikis are there, that configuration is needed only 
when the subdomain is "wrong". It's still not clear to me what codes you 
are considering wrong.


Nemo

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-04 Thread Markus Krötzsch

Let me top-post a question to the Wikidata dev team:

Where can we find documentation on what the Wikidata internal language 
codes actually mean? In particular, how do you map the language selector 
to the internal codes? I noticed some puzzling details:


* Wikidata uses "be-x-old" as a code, but MediaWiki messages for this 
language seem to use "be-tarask" as a language code. So there must be a 
mapping somewhere. Where?


* MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes 
provides some mappings. For example, it maps "zh-yue" to "yue". Yet, 
Wikidata use both of these codes. What does this mean?


Answers to Nemo's points inline:

On 04/08/13 06:15, Federico Leva (Nemo) wrote:

Markus Krötzsch, 03/08/2013 15:48:

(3) Limited language support. The script uses Wikidata's internal
language codes for string literals in RDF. In some cases, this might not
be correct. It would be great if somebody could create a mapping from
Wikidata language codes to BCP47 language codes (let me know if you
think you can do this, and I'll tell you where to put it)


These are only a handful, aren't they?


There are about 369 language codes right now. You can see the complete 
list in langCodes at the bottom of the file


https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py

Most might be correct already, but it is hard to say. Also, is it okay 
to create new (sub)language codes for our own purposes? Something like 
simple English will hardly have an official code, but it would be bad to 
export is as "en".





(4) Limited site language support. To specify the language of linked
wiki sites, the script extracts a language code from the URL of the
site. Again, this might not be correct in all cases, and it would be
great if somebody had a proper mapping from Wikipedias/Wikivoyages to
language codes.


Apart from the above, doesn't wgLanguageCode in
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
have what you need?


Interesting. However, the list there does not contain all 300 sites that 
we currently find in Wikidata dumps (and some that we do not find there, 
including things like dkwiki that seem to be outdated). The full list of 
sites we support is also found in the file I mentioned above, just after 
the language list (variable siteLanguageCodes).


Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-03 Thread Federico Leva (Nemo)

Markus Krötzsch, 03/08/2013 15:48:

(3) Limited language support. The script uses Wikidata's internal
language codes for string literals in RDF. In some cases, this might not
be correct. It would be great if somebody could create a mapping from
Wikidata language codes to BCP47 language codes (let me know if you
think you can do this, and I'll tell you where to put it)


These are only a handful, aren't they?


(4) Limited site language support. To specify the language of linked
wiki sites, the script extracts a language code from the URL of the
site. Again, this might not be correct in all cases, and it would be
great if somebody had a proper mapping from Wikipedias/Wikivoyages to
language codes.


Apart from the above, doesn't wgLanguageCode in 
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php 
have what you need?


Nemo

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-03 Thread Markus Krötzsch
Update: the first bugs in the export have already been discovered -- and 
fixed in the script on github. The files I uploaded will be updated on 
Monday when I have a better upload again (the links file should be fine, 
the statements file requires a rather tolerant Turtle string literal 
parser, and the labels file has a malformed line that will hardly work 
anywhere).


Markus

On 03/08/13 14:48, Markus Krötzsch wrote:

Hi,

I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in English
and Wolof). The file creation takes a few (about three) hours on my
machine depending on what exactly is exported.

For your convenience, I have created some example exports based on
yesterday's dumps. These can be found at [2]. There are three Turtle
files: site links only, labels/descriptions/aliases only, statements
only. The fourth file is a preliminary version of the Wikibase ontology
that is used in the exports.

The export format is based on our earlier proposal [3], but it adds a
lot of details that had not been specified there yet (namespaces,
references, ID generation, compound datavalue encoding, etc.). Details
might still change, of course. We might provide regular dumps at another
location once the format is stable.

As a side effect of these activities, the wda toolkit [1] is also
getting more convenient to use. Creating code for exporting the data
into other formats is quite easy.

Features and known limitations of the wda RDF export:

(1) All current Wikidata datatypes are supported. Commons-media data is
correctly exported as URLs (not as strings).

(2) One-pass processing. Dumps are processed only once, even though this
means that we may not know the types of all properties when we first
need them: the script queries wikidata.org to find missing information.
This is only relevant when exporting statements.

(3) Limited language support. The script uses Wikidata's internal
language codes for string literals in RDF. In some cases, this might not
be correct. It would be great if somebody could create a mapping from
Wikidata language codes to BCP47 language codes (let me know if you
think you can do this, and I'll tell you where to put it)

(4) Limited site language support. To specify the language of linked
wiki sites, the script extracts a language code from the URL of the
site. Again, this might not be correct in all cases, and it would be
great if somebody had a proper mapping from Wikipedias/Wikivoyages to
language codes.

(5) Some data excluded. Data that cannot currently be edited is not
exported, even if it is found in the dumps. Examples include statement
ranks and timezones for time datavalues. I also currently exclude labels
and descriptions for simple English, formal German, and informal Dutch,
since these would pollute the label space for English, German, and Dutch
without adding much benefit (other than possibly for simple English
descriptions, I cannot see any case where these languages should ever
have different Wikidata texts at all).

Feedback is welcome.

Cheers,

Markus

[1] https://github.com/mkroetzsch/wda
 Run "python wda-export.data.py --help" for usage instructions
[2] http://semanticweb.org/RDF/Wikidata/
[3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


[Wikidata-l] Wikidata RDF export available

2013-08-03 Thread Markus Krötzsch

Hi,

I am happy to report that an initial, yet fully functional RDF export 
for Wikidata is now available. The exports can be created using the 
wda-export-data.py script of the wda toolkit [1]. This script downloads 
recent Wikidata database dumps and processes them to create RDF/Turtle 
files. Various options are available to customize the output (e.g., to 
export statements but not references, or to export only texts in English 
and Wolof). The file creation takes a few (about three) hours on my 
machine depending on what exactly is exported.


For your convenience, I have created some example exports based on 
yesterday's dumps. These can be found at [2]. There are three Turtle 
files: site links only, labels/descriptions/aliases only, statements 
only. The fourth file is a preliminary version of the Wikibase ontology 
that is used in the exports.


The export format is based on our earlier proposal [3], but it adds a 
lot of details that had not been specified there yet (namespaces, 
references, ID generation, compound datavalue encoding, etc.). Details 
might still change, of course. We might provide regular dumps at another 
location once the format is stable.


As a side effect of these activities, the wda toolkit [1] is also 
getting more convenient to use. Creating code for exporting the data 
into other formats is quite easy.


Features and known limitations of the wda RDF export:

(1) All current Wikidata datatypes are supported. Commons-media data is 
correctly exported as URLs (not as strings).


(2) One-pass processing. Dumps are processed only once, even though this 
means that we may not know the types of all properties when we first 
need them: the script queries wikidata.org to find missing information. 
This is only relevant when exporting statements.


(3) Limited language support. The script uses Wikidata's internal 
language codes for string literals in RDF. In some cases, this might not 
be correct. It would be great if somebody could create a mapping from 
Wikidata language codes to BCP47 language codes (let me know if you 
think you can do this, and I'll tell you where to put it)


(4) Limited site language support. To specify the language of linked 
wiki sites, the script extracts a language code from the URL of the 
site. Again, this might not be correct in all cases, and it would be 
great if somebody had a proper mapping from Wikipedias/Wikivoyages to 
language codes.


(5) Some data excluded. Data that cannot currently be edited is not 
exported, even if it is found in the dumps. Examples include statement 
ranks and timezones for time datavalues. I also currently exclude labels 
and descriptions for simple English, formal German, and informal Dutch, 
since these would pollute the label space for English, German, and Dutch 
without adding much benefit (other than possibly for simple English 
descriptions, I cannot see any case where these languages should ever 
have different Wikidata texts at all).


Feedback is welcome.

Cheers,

Markus

[1] https://github.com/mkroetzsch/wda
Run "python wda-export.data.py --help" for usage instructions
[2] http://semanticweb.org/RDF/Wikidata/
[3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF

--
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529   http://korrekt.org/

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l