Re: [Wikidata-l] [Wikimedia-l] Meeting about the support of Wiktionary in Wikidata

2013-08-10 Thread Denny Vrandečić
[Sorry for cross-posting]

Yes, I agree that the OmegaWiki community should be involved in the
discussions, and I pointed GerardM to our proposals whenever and
discussions, using him as a liaison. We also looked and keep looking at the
OmegaWiki data model to see what we are missing.

Our latest proposal is different from OmegaWiki in two major points:

* our primary goal is to provide support for structured data in the
Wiktionaries. We do not plan to be the main resource ourselves, where
readers come to in order to look up something, we merely provide structured
data that a Wiktionary may or may not use. This parallels the role of
Wikidata has with regards to Wikipedia. This also highlights the difference
between Wikidata and OmegaWiki, since OmegaWiki's goal is to create a
dictionary of all words of all languages, including lexical, terminological
and ontological information.

* a smaller difference is the data model. Wikidata's latest proposal to
support Wiktionary is centered around lexemes, and we do not assume that
there is such a things as a language-independent defined meaning. But no
matter what model we end up with, it is important to ensure that the bulk
of the data could freely flow between the projects, and even though we
might disagree on this issue in the modeling, it is ensured that the
exchange of data is widely possible.

We tried to keep notes on the discussion we had today: 
http://epl.wikimedia.org/p/WiktionaryAndWikidata

My major take home message for me is that:
* the proposal needs more visual elements, especially a mock-up or sketch
of how it would look like and how it could be used on the Wiktionaries
* there is no generally accepted place for a discussion that involves all
Wiktionary projects. Still, my initial decision to have the discussion on
the Wikidata wiki was not a good one, and it should and will be moved to
Meta.

Having said that, the current proposal for the data model of how to support
Wiktionary with Wikidata seems to have garnered a lot of support so far. So
this is what I will continue building upon. Further comments are extremely
welcomed. You can find it here:

http://www.wikidata.org/wiki/Wikidata:Wiktionary

As said, it will be moved to Meta, as soon as the requested mockups and
extensions are done.

Cheers,
Denny





2013/8/10 Samuel Klein meta...@gmail.com

 Hello,

  On Fri, Aug 9, 2013 at 6:13 PM, JP Béland lebo.bel...@gmail.com wrote:
  I agree. We also need to include the Omegawiki community.

 Agreed.

 On Fri, Aug 9, 2013 at 12:22 PM, Laura Hale la...@fanhistory.com wrote:
  Why? The question of moving them into the WMF fold was pretty much no,
  because the project has an overlapping purpose with Wiktionary,

 This is not actually the case.
 There was overwhelming community support for adopting Omegawiki - at
 least simply providing hosting.  It stalled because the code needed a
 security and style review, and Kip (the lead developer) was going to
 put some time into that.  The OW editors and dev were very interested
 in finding a way forward that involved Wikidata and led to a combined
 project with a single repository of terms, meanings, definitions and
 translations.

 Recap: The page describing the OmegaWiki project satisfies all of the
 criteria for requesting WMF adoption.
 * It is well-defined on Meta http://meta.wikimedia.org/wiki/Omegawiki
 * It describes an interesting idea clearly aligned with expanding the
 scope of free knowledge
 * It is not a 'competing' project to Wiktionaries; it is an idea that
 grew out of the Wiktionary community, has been developed for years
 alongside it, and shares many active contributors and linguiaphiles.
 * It started an RfC which garnered 85% support for adoption.
 http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki

 Even if the current OW code is not used at all for a future Wiktionary
 update -- and this idea was proposed and taken seriously by the OW
 devs -- their community of contributors should be part of discussions
 about how to solve the Wiktionary problem that they were the first to
 dedicate themselves to.

 Regards,
 Sam.

 ___
 Wikimedia-l mailing list
 wikimedi...@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe




-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] [Wikimedia-l] Meeting about the support of Wiktionary in Wikidata

2013-08-10 Thread David Cuenca
To add up a couple of comments to what Denny said, from my experience with
Wikisource, reaching out to international, loosely connected communities is
already a big challenge on its own. I would like to invite Wiktionary
contributors to take a look to this Individual Engagement Grant project
that Aubrey and me are doing for Wikisource, because maybe it would make
sense that a group of involved Wiktionarians started a similar initiative
for Wiktionary. The original application can be found here:
http://meta.wikimedia.org/wiki/Grants:IEG/Elaborate_Wikisource_strategic_vision

And the midterm report:
http://meta.wikimedia.org/wiki/Grants:IEG/Elaborate_Wikisource_strategic_vision

If anyone from the Wiktionary community wants to step forward, I would be
more than happy to share experiences and provide advice.

Cheers,
Micru

On Sat, Aug 10, 2013 at 3:30 AM, Denny Vrandečić 
denny.vrande...@wikimedia.de wrote:

 [Sorry for cross-posting]

 Yes, I agree that the OmegaWiki community should be involved in the
 discussions, and I pointed GerardM to our proposals whenever and
 discussions, using him as a liaison. We also looked and keep looking at the
 OmegaWiki data model to see what we are missing.

 Our latest proposal is different from OmegaWiki in two major points:

 * our primary goal is to provide support for structured data in the
 Wiktionaries. We do not plan to be the main resource ourselves, where
 readers come to in order to look up something, we merely provide structured
 data that a Wiktionary may or may not use. This parallels the role of
 Wikidata has with regards to Wikipedia. This also highlights the difference
 between Wikidata and OmegaWiki, since OmegaWiki's goal is to create a
 dictionary of all words of all languages, including lexical, terminological
 and ontological information.

 * a smaller difference is the data model. Wikidata's latest proposal to
 support Wiktionary is centered around lexemes, and we do not assume that
 there is such a things as a language-independent defined meaning. But no
 matter what model we end up with, it is important to ensure that the bulk
 of the data could freely flow between the projects, and even though we
 might disagree on this issue in the modeling, it is ensured that the
 exchange of data is widely possible.

 We tried to keep notes on the discussion we had today: 
 http://epl.wikimedia.org/p/WiktionaryAndWikidata

 My major take home message for me is that:
 * the proposal needs more visual elements, especially a mock-up or sketch
 of how it would look like and how it could be used on the Wiktionaries
 * there is no generally accepted place for a discussion that involves all
 Wiktionary projects. Still, my initial decision to have the discussion on
 the Wikidata wiki was not a good one, and it should and will be moved to
 Meta.

 Having said that, the current proposal for the data model of how to support
 Wiktionary with Wikidata seems to have garnered a lot of support so far. So
 this is what I will continue building upon. Further comments are extremely
 welcomed. You can find it here:

 http://www.wikidata.org/wiki/Wikidata:Wiktionary

 As said, it will be moved to Meta, as soon as the requested mockups and
 extensions are done.

 Cheers,
 Denny





 2013/8/10 Samuel Klein meta...@gmail.com

  Hello,
 
   On Fri, Aug 9, 2013 at 6:13 PM, JP Béland lebo.bel...@gmail.com
 wrote:
   I agree. We also need to include the Omegawiki community.
 
  Agreed.
 
  On Fri, Aug 9, 2013 at 12:22 PM, Laura Hale la...@fanhistory.com
 wrote:
   Why? The question of moving them into the WMF fold was pretty much no,
   because the project has an overlapping purpose with Wiktionary,
 
  This is not actually the case.
  There was overwhelming community support for adopting Omegawiki - at
  least simply providing hosting.  It stalled because the code needed a
  security and style review, and Kip (the lead developer) was going to
  put some time into that.  The OW editors and dev were very interested
  in finding a way forward that involved Wikidata and led to a combined
  project with a single repository of terms, meanings, definitions and
  translations.
 
  Recap: The page describing the OmegaWiki project satisfies all of the
  criteria for requesting WMF adoption.
  * It is well-defined on Meta http://meta.wikimedia.org/wiki/Omegawiki
  * It describes an interesting idea clearly aligned with expanding the
  scope of free knowledge
  * It is not a 'competing' project to Wiktionaries; it is an idea that
  grew out of the Wiktionary community, has been developed for years
  alongside it, and shares many active contributors and linguiaphiles.
  * It started an RfC which garnered 85% support for adoption.
  http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki
 
  Even if the current OW code is not used at all for a future Wiktionary
  update -- and this idea was proposed and taken seriously by the OW
  devs -- their community of contributors should be part of 

Re: [Wikidata-l] question about 2 different json formats

2013-08-10 Thread Jiang BIAN
On Wed, Aug 7, 2013 at 10:11 PM, Denny Vrandečić 
denny.vrande...@wikimedia.de wrote:

 Hi Anthony,

 that's the internal data structure, and this is bound to change without
 notice. I am sorry if this caused trouble.

 If this is a common concern, we will start documenting and announcing
 those changes. It really should only concern the people processing the XML
 dumps.

 We would prefer to actually create a more stable output dump of the
 knowledge - I guess this would be more appreciated (like the RDF dump that
 Markus has posted about recently).

 The call to get the item description should have been

 
 https://www.wikidata.org/w/api.php?action=wbgetentitiesformat=jsonids=Q1
 

 This should provide you with a more stable answer.

 Cheers,
 Denny




 2013/8/1 Huidong Zhang anthonyzh...@google.com

  Hi,

 I noticed that the response from 
 http://www.wikidata.org/w/api.php?action=querytitles=Q1prop=revisionsrvprop=contentformat=xml;
 changed from entity:q1 to entity:[item,1].
 Is this change applied to all pages?

 In the latest wikidata dump (
 http://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-meta-current.xml.bz2),
 both formats exist at the same time. For example, page Q100 has:
 entity:[item,100], while page Q10 has entity:q10. Is it
 expected? Will the next dump have same format?
 By the way, 
 http://www.wikidata.org/w/api.php?action=querytitles=Q10prop=revisionsrvprop=contentformat=xml;
 return entity:[item,10].


About the inconsistency in the dump file, is there any bug entry created
for this?
(I can create one, if anyone can point me the proper place to do that).




 Thanks.

 --
 Best wishes,
 Anthony Zhang (Huidong Zhang)

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l




 --
 Project director Wikidata
 Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
 Tel. +49-30-219 158 26-0 | http://wikimedia.de

 Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
 Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
 der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
 Körperschaften I Berlin, Steuernummer 27/681/51985.

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l




-- 
Jiang BIAN

This email may be confidential or privileged.  If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it went to
the wrong person.  Thanks.
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] question about 2 different json formats

2013-08-10 Thread Byrial Jensen

On 10-08-2013 10:54, Jiang BIAN wrote:

On Wed, Aug 7, 2013 at 10:11 PM, Denny Vrandečić
denny.vrande...@wikimedia.de mailto:denny.vrande...@wikimedia.de wrote:

Hi Anthony,

that's the internal data structure, and this is bound to change
without notice. I am sorry if this caused trouble.

If this is a common concern, we will start documenting and
announcing those changes. It really should only concern the people
processing the XML dumps.


I am one of the people processing the XML dumps, and I don't think it is 
a big deal. But I have had to change my parser many times to be able to 
parse new dumps because of changes in the format (in most cases, but not 
always, because of new features),


I just adapt to the changes without fuss, but if the format was 
documented I could file bug reports whenever the format is deviating 
from the documentation which might be helpful to the developers.


(BTW, the time values seems to be OK again, after many syntax errors in 
the beginning. But the coordinate values have some strange (probably 
erroneous?) variations: Values where the precision and/or globe is given 
as null, and values where the globe is given as the string earth 
instead of an entity).



About the inconsistency in the dump file, is there any bug entry created
for this?
(I can create one, if anyone can point me the proper place to do that).


Not for my sake. I adapted to two entity formats in the dumps 
immediately when the new format started to appear.


Best regards,
- Byrial


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata language codes

2013-08-10 Thread John Erling Blad
The language code no is the metacode for Norwegian, and nowiki was
in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk.
The later split of and made nnwiki, but nowiki continued as before.
After a while all Nynorsk content was migrated. Now nowiki has content
in Bokmål and Riksmål, first one is official in Norway and the later
is an unofficial variant. After the last additions to Bokmål there are
very few forms that are only legal n Riksmål, so for all practical
purposes nowiki has become a pure Bokmål wiki.

I think all content in Wikidata should use either nn or nb, and
all existing content with no as language code should be folded into
nb. It would be nice if no could be used as an alias for nb, as
this is de facto situation now, but it is probably not necessary and
could create a discussion with the Nynorsk community.

The site code should be nowiki as long as the community does not ask
for a change.

jeblad

On 8/6/13, Markus Krötzsch mar...@semantic-mediawiki.org wrote:
 Hi Purodha,

 thanks for the helpful hints. I have implemented most of these now in
 the list on git (this is also where you can see the private codes I have
 created where needed). I don't see a big problem in changing the codes
 in future exports if better options become available (it's much easier
 than changing codes used internally).

 One open question that I still have is what it means if a language that
 usually has a script tag appears without such a tag (zh vs.
 zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that
 we do not know which script is used under this code (either could appear)?

 The other question is about the duplicate language tags, such as 'crh'
 and 'crh-Latn', which both appear in the data but are mapped to the same
 code. Maybe one of the codes is just phased out and will disappear over
 time? I guess the Wikidata team needs to answer this. We also have some
 codes that mean the same according to IANA, namely kk and kk-Cyrl, but
 which are currently not mapped to the same canonical IANA code.

 Finally, I wondered about Norwegian. I gather that no.wikipedia.org is
 in Norwegian Bokmål (nb), which is how I map the site now. However, the
 language data in the dumps (not the site data) uses both no and nb.
 Moreover, many items have different texts for nb and no. I wonder if
 both are still Bokmål, and there is just a bug that allows people to
 enter texts for nb under two language settings (for descriptions this
 could easily be a different text, even if in the same language). We also
 have nn, and I did not check how this relates to no (same text or
 different?).

 Cheers,
 Markus

 On 05/08/13 15:41, P. Blissenbach wrote:
 Hi Markus,
 Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl',
 likewise
 is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both
 might change,
 once dialect codes of Serbian are added to the IANA subtag registry at
 http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
 Our code 'nrm' is not being used for the Narom language as ISO 639-3
 does, see:
 http://www-01.sil.org/iso639-3/documentation.asp?id=nrm
 We rather use it for the Norman / Nourmaud, as described in
 http://en.wikipedia.org/wiki/Norman_language
 The Norman language is recognized by the linguist list and many others
 but as of
 yet not present in ISO 639-3. It should probably be suggested to be added.
 We should probaly map it to a private code meanwhile.
 Our code 'ksh' is currently being used to represent a superset of what
 it stands for
 in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the
 code of the
 only Ripuarian variety (of dozens) having a code, to represent the whole
 lot. We
 should probably suggest to add a group code to ISO 639, and at least the
 dozen+
 Ripuarian languages that we are using, and map 'ksh' to a private code
 for Ripuarian
 meanwhile.
 Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are
 not
 guaranteed to be in the languages of the Wikipedias. They are often in
 German
 instead. Details to be found in their respective page titleing rules.
 Moreover,
 for the ksh Wikipedia, unlike some other multilingual or multidialectal
 Wikipedias,
 texts are not, or quite often incorrectly, labelled as belonging to a
 certain dialect.
 See also: http://meta.wikimedia.org/wiki/Special_language_codes
 Greetings -- Purodha
 *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr
 *Von:* Markus Krötzsch mar...@semantic-mediawiki.org
 *An:* Federico Leva (Nemo) nemow...@gmail.com
 *Cc:* Discussion list for the Wikidata project.
 wikidata-l@lists.wikimedia.org
 *Betreff:* [Wikidata-l] Wikidata language codes (Was: Wikidata RDF
 export available)
 Small update: I went through the language list at

 https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472

 and added a number of TODOs to the most obvious problematic cases.
 Typical problems are:

 * 

Re: [Wikidata-l] question about 2 different json formats

2013-08-10 Thread Markus Krötzsch

On 10/08/13 10:29, Byrial Jensen wrote:
...


(BTW, the time values seems to be OK again, after many syntax errors in
the beginning. But the coordinate values have some strange (probably
erroneous?) variations: Values where the precision and/or globe is given
as null, and values where the globe is given as the string earth
instead of an entity).


Thanks for the warning. This was something that has been causing 
problems in the RDF dump too. I am now validating the globe settings 
more carefully.


Cheers,

Markus




About the inconsistency in the dump file, is there any bug entry created
for this?
(I can create one, if anyone can point me the proper place to do that).


Not for my sake. I adapted to two entity formats in the dumps
immediately when the new format started to appear.

Best regards,
- Byrial


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF export available

2013-08-10 Thread Markus Krötzsch
Good morning. I just found a bug that was caused by a bug in the 
Wikidata dumps (a value that should be a URI was not). This led to a few 
dozen lines with illegal qnames of the form w: . The updated script 
fixes this.


Cheers,

Markus

On 09/08/13 18:15, Markus Krötzsch wrote:

Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.


You mean just as in at around 15:30 today ;-)? The code is under
heavy development, so changes are quite frequent. Please expect things
to be broken in some cases (this is just a little community project, not
part of the official Wikidata development).

I have just uploaded a new statements export (20130808) to
http://semanticweb.org/RDF/Wikidata/ which you might want to try.



I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people (undergraduate
and PhD students, as well as Post-Docs) implement simple serializers
for RDF.

They all failed.

This was normally not due to the lack of skill, but due to the lack of
missing time. They wanted to do it quick, but they didn't have the time
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special
characters in URIs. I would direly advise you to:

1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with rapper
3. use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)


Yes, URI encoding could be difficult if we were doing it manually. Note,
however, that we are already using a standard library for URI encoding
in all non-trivial cases, so this does not seem to be a very likely
cause of the problem (though some non-zero probability remains). In
general, it is not unlikely that there are bugs in the RDF somewhere;
please consider this export as an early prototype that is meant for
experimentation purposes. If you want an official RDF dump, you will
have to wait for the Wikidata project team to get around doing it (this
will surely be based on an RDF library). Personally, I already found the
dump useful (I successfully imported some 109 million triples of some
custom script into an RDF store), but I know that it can require some
tweaking.



We are having a problem currently, because we tried to convert the dump
to NTriples (which would be handled by a framework as well) with rapper.
We assume that the error is an extra  somewhere (not confirmed) and
we are still searching for it since the dump is so big


Ok, looking forward to hear about the results of your search. A good tip
for checking such things is to use grep. I did a quick grep on my
current local statements export to count the numbers of  and  (this
takes less than a minute on my laptop, including on-the-fly
decompression). Both numbers were equal, making it unlikely that there
is any unmatched  in the current dumps. Then I used grep to check that
 and  only occur in the statements files in lines with commons URLs.
These are created using urllib, so there should never be any  or  in
them.


so we can not provide a detailed bug report. If we had one triple per
line, this would also be easier, plus there are advantages for stream
reading. bzip2 compression is very good as well, no need for prefix
optimization.


Not sure what you mean here. Turtle prefixes in general seem to be a
Good Thing, not just for reducing the file size. The code has no easy
way to get rid of prefixes, but if you want a line-by-line export you
could subclass my exporter and overwrite the methods for incremental
triple writing so that they remember the last subject (or property) and
create full triples instead. This would give you a line-by-line export
in (almost) no time (some uses of [...] blocks in object positions would
remain, but maybe you could live with that).

Best wishes,

Markus



All the best,
Sebastian

Am 03.08.2013 23:22, schrieb Markus Krötzsch:

Update: the first bugs in the export have already been discovered --
and fixed in the script on github. The files I uploaded will be
updated on Monday when I have a better upload again (the links file
should be fine, the statements file requires a rather tolerant Turtle
string literal parser, and the labels file has a malformed line that
will hardly work anywhere).

Markus

On 03/08/13 14:48, Markus Krötzsch wrote:

Hi,

I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in
English
and Wolof). The file 

Re: [Wikidata-l] Wikidata RDF export available

2013-08-10 Thread Sebastian Hellmann

Hi Markus!
Thank you very much.

Regarding your last email:
Of course, I am aware of your arguments in your last email, that the 
dump is not official. Nevertheless, I am expecting you and others to 
code (or supervise) similar RDF dumping projects in the future.


Here are two really important things to consider:

1. Always use a mature RDF framework for serializing:
Even DBpedia was publishing RDF for years that had some errors in it, 
this was really frustrating for maintainers (handling bug reports) and 
clients (trying to quick-fix it).
Other small projects (in fact exactly the same as yours Markus, a guy 
publishing some useful software) went the same way: Lot's of small 
syntax bugs, many bug requests, lot of additional work. Some of them 
were abandoned because the developer didn't have time anymore.


2. Use NTriples or one-triple-per-line Turtle:
(Turtle supports IRIs and unicode, compare)
curl 
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | 
bzcat | head
curl 
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 | 
bzcat | head


one-triple-per-line let's you
a) find errors easier and
b) aids further processing, e.g. calculate the outdegree of subjects:
curl 
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | 
bzcat | head -100 | cut -f1 -d '' | grep -v '^#' | sed 's///;s///' | 
awk '{count[$1]++}END{for(j in count) print  j  \tcount [j]}'


Furthermore:
- Parsers can treat one-triple-per-line more robust, by just skipping lines
- compression size is the same
- alphabetical ordering of data works well (e.g. for GitHub diffs)
- you can split the files in several smaller files easily


Blank nodes have some bad properties:
- some databases react weird to them and they sometimes fill up indexes 
and make the DB slow (depends on the implementations of course, this is 
just my experience )

- make splitting one-triple-per-line more difficult
- difficult for SPARQL to resolve recursively
- see http://videolectures.net/iswc2011_mallea_nodes/ or 
http://web.ing.puc.cl/~marenas/publications/iswc11.pdf



Turtle prefixes:
Why do you think they are a good thing? They are disputed as sometimes 
as a premature feature. They do make data more readable, but nobody is 
going to read 4.4 GB of Turtle.

By the way, you can always convert it to turtle easily:
curl 
http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | 
bzcat | head -100  | rapper -i turtle -o turtle -I - - file


All the best,
Sebastian



Am 10.08.2013 12:44, schrieb Markus Krötzsch:
Good morning. I just found a bug that was caused by a bug in the 
Wikidata dumps (a value that should be a URI was not). This led to a 
few dozen lines with illegal qnames of the form w: . The updated 
script fixes this.


Cheers,

Markus

On 09/08/13 18:15, Markus Krötzsch wrote:

Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.


You mean just as in at around 15:30 today ;-)? The code is under
heavy development, so changes are quite frequent. Please expect things
to be broken in some cases (this is just a little community project, not
part of the official Wikidata development).

I have just uploaded a new statements export (20130808) to
http://semanticweb.org/RDF/Wikidata/ which you might want to try.



I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people 
(undergraduate

and PhD students, as well as Post-Docs) implement simple serializers
for RDF.

They all failed.

This was normally not due to the lack of skill, but due to the lack of
missing time. They wanted to do it quick, but they didn't have the time
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special
characters in URIs. I would direly advise you to:

1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with rapper
3. use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)


Yes, URI encoding could be difficult if we were doing it manually. Note,
however, that we are already using a standard library for URI encoding
in all non-trivial cases, so this does not seem to be a very likely
cause of the problem (though some non-zero probability remains). In
general, it is not unlikely that there are bugs in the RDF somewhere;
please consider this export as an early prototype that is meant for
experimentation purposes. If you want an official RDF dump, you will
have to wait for the Wikidata project team to get around doing it (this
will surely be based on an RDF library). Personally, I already found the
dump useful (I successfully imported some 109 million triples of some
custom script into an RDF store), but I know that it can require 

[Wikidata-l] Wikidata slides on Wikimania2013

2013-08-10 Thread Jiang BIAN
Hi,

Is there a place that I can find the slides used on this Wikimania? How
about link them on the submission page, e.g. State of
Wikidatahttp://wikimania2013.wikimedia.org/wiki/Submissions/State_of_Wikidata
.


Thanks


-- 
Jiang BIAN

This email may be confidential or privileged.  If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it went to
the wrong person.  Thanks.
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata slides on Wikimania2013

2013-08-10 Thread Federico Leva (Nemo)

Jiang BIAN, 10/08/2013 18:48:

Hi,

Is there a place that I can find the slides used on this Wikimania?


They have to go here: 
https://commons.wikimedia.org/wiki/Category:Wikimania_2013_presentation_slides

Poke by email the presenters who didn't upload them...


How
about link them on the submission page, e.g. State of Wikidata
http://wikimania2013.wikimedia.org/wiki/Submissions/State_of_Wikidata.


Yes, please link or transclude on their wiki page all those you can find.

Nemo

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata language codes

2013-08-10 Thread Markus Krötzsch

On 10/08/13 11:07, John Erling Blad wrote:

The language code no is the metacode for Norwegian, and nowiki was
in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk.
The later split of and made nnwiki, but nowiki continued as before.
After a while all Nynorsk content was migrated. Now nowiki has content
in Bokmål and Riksmål, first one is official in Norway and the later
is an unofficial variant. After the last additions to Bokmål there are
very few forms that are only legal n Riksmål, so for all practical
purposes nowiki has become a pure Bokmål wiki.

I think all content in Wikidata should use either nn or nb, and
all existing content with no as language code should be folded into
nb. It would be nice if no could be used as an alias for nb, as
this is de facto situation now, but it is probably not necessary and
could create a discussion with the Nynorsk community.

The site code should be nowiki as long as the community does not ask
for a change.


Thanks for the clarification. I will keep no to mean no for now.

What I wonder is: if users choose to enter a no label on Wikidata, 
what is the language setting that they see? Does this say Norwegian 
(any variant) or what? That's what puzzles me. I know that a Wikipedia 
can allow multiple languages (or dialects) to coexist, but in the 
Wikidata language selector I thought you can only select real 
languages, not language groups.


Markus




On 8/6/13, Markus Krötzsch mar...@semantic-mediawiki.org wrote:

Hi Purodha,

thanks for the helpful hints. I have implemented most of these now in
the list on git (this is also where you can see the private codes I have
created where needed). I don't see a big problem in changing the codes
in future exports if better options become available (it's much easier
than changing codes used internally).

One open question that I still have is what it means if a language that
usually has a script tag appears without such a tag (zh vs.
zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that
we do not know which script is used under this code (either could appear)?

The other question is about the duplicate language tags, such as 'crh'
and 'crh-Latn', which both appear in the data but are mapped to the same
code. Maybe one of the codes is just phased out and will disappear over
time? I guess the Wikidata team needs to answer this. We also have some
codes that mean the same according to IANA, namely kk and kk-Cyrl, but
which are currently not mapped to the same canonical IANA code.

Finally, I wondered about Norwegian. I gather that no.wikipedia.org is
in Norwegian Bokmål (nb), which is how I map the site now. However, the
language data in the dumps (not the site data) uses both no and nb.
Moreover, many items have different texts for nb and no. I wonder if
both are still Bokmål, and there is just a bug that allows people to
enter texts for nb under two language settings (for descriptions this
could easily be a different text, even if in the same language). We also
have nn, and I did not check how this relates to no (same text or
different?).

Cheers,
Markus

On 05/08/13 15:41, P. Blissenbach wrote:

Hi Markus,
Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl',
likewise
is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both
might change,
once dialect codes of Serbian are added to the IANA subtag registry at
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Our code 'nrm' is not being used for the Narom language as ISO 639-3
does, see:
http://www-01.sil.org/iso639-3/documentation.asp?id=nrm
We rather use it for the Norman / Nourmaud, as described in
http://en.wikipedia.org/wiki/Norman_language
The Norman language is recognized by the linguist list and many others
but as of
yet not present in ISO 639-3. It should probably be suggested to be added.
We should probaly map it to a private code meanwhile.
Our code 'ksh' is currently being used to represent a superset of what
it stands for
in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the
code of the
only Ripuarian variety (of dozens) having a code, to represent the whole
lot. We
should probably suggest to add a group code to ISO 639, and at least the
dozen+
Ripuarian languages that we are using, and map 'ksh' to a private code
for Ripuarian
meanwhile.
Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are
not
guaranteed to be in the languages of the Wikipedias. They are often in
German
instead. Details to be found in their respective page titleing rules.
Moreover,
for the ksh Wikipedia, unlike some other multilingual or multidialectal
Wikipedias,
texts are not, or quite often incorrectly, labelled as belonging to a
certain dialect.
See also: http://meta.wikimedia.org/wiki/Special_language_codes
Greetings -- Purodha
*Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr
*Von:* Markus Krötzsch mar...@semantic-mediawiki.org
*An:* Federico Leva (Nemo) 

Re: [Wikidata-l] Scope of a Wikidata entry

2013-08-10 Thread Andrew Gray
Yes, I think multiple identities attached to a single wikidata entity is
the way to go forward. ~We talked about this briefly at Wikimania on Friday
and the consensus was still a bit unclear ;-)

Once qualifiers are properly up and running we might be able to mark them
as preferred or main relation vs. secondary identifiers (the main VIAF
cluster vs the isolated entries, for example)

A.

On Sunday, 11 August 2013, Luca Martinelli wrote:

 2013/7/31 Andrew Gray andrew.g...@dunelm.org.uk javascript:;:
  Hi Nicholas,
 
  a) Yes, it is about the person and the aliases together. As a general
  rule, it's one article per person, not per name.
 
  b) Different names is a quirk of the Wikipedia background - these
  default to the title of the Wikipedia article on that person, and
  there's no agreement on whether to put the article under the person or
  the more famous pseudonym.

 FYI, there is now a property for pseudonyms (
 http://www.wikidata.org/wiki/Property:P742 ).

  d) I think the initial assumption was that there was a 1=1 match, but
  if there are multiple musicbrainz id's representing facets of the same
  entity, then Wikidata will support adding several.

 It is possible to put several IDs coming from the same database.
 Actually, I'm trying to do this with multiple VIAF codes referring to
 the same author, and it could also become a feedback to the original
 database.

 --
 Luca Sannita Martinelli
 http://it.wikipedia.org/wiki/Utente:Sannita

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l