Re: [Wikidata-l] [Wikimedia-l] Meeting about the support of Wiktionary in Wikidata
[Sorry for cross-posting] Yes, I agree that the OmegaWiki community should be involved in the discussions, and I pointed GerardM to our proposals whenever and discussions, using him as a liaison. We also looked and keep looking at the OmegaWiki data model to see what we are missing. Our latest proposal is different from OmegaWiki in two major points: * our primary goal is to provide support for structured data in the Wiktionaries. We do not plan to be the main resource ourselves, where readers come to in order to look up something, we merely provide structured data that a Wiktionary may or may not use. This parallels the role of Wikidata has with regards to Wikipedia. This also highlights the difference between Wikidata and OmegaWiki, since OmegaWiki's goal is to create a dictionary of all words of all languages, including lexical, terminological and ontological information. * a smaller difference is the data model. Wikidata's latest proposal to support Wiktionary is centered around lexemes, and we do not assume that there is such a things as a language-independent defined meaning. But no matter what model we end up with, it is important to ensure that the bulk of the data could freely flow between the projects, and even though we might disagree on this issue in the modeling, it is ensured that the exchange of data is widely possible. We tried to keep notes on the discussion we had today: http://epl.wikimedia.org/p/WiktionaryAndWikidata My major take home message for me is that: * the proposal needs more visual elements, especially a mock-up or sketch of how it would look like and how it could be used on the Wiktionaries * there is no generally accepted place for a discussion that involves all Wiktionary projects. Still, my initial decision to have the discussion on the Wikidata wiki was not a good one, and it should and will be moved to Meta. Having said that, the current proposal for the data model of how to support Wiktionary with Wikidata seems to have garnered a lot of support so far. So this is what I will continue building upon. Further comments are extremely welcomed. You can find it here: http://www.wikidata.org/wiki/Wikidata:Wiktionary As said, it will be moved to Meta, as soon as the requested mockups and extensions are done. Cheers, Denny 2013/8/10 Samuel Klein meta...@gmail.com Hello, On Fri, Aug 9, 2013 at 6:13 PM, JP Béland lebo.bel...@gmail.com wrote: I agree. We also need to include the Omegawiki community. Agreed. On Fri, Aug 9, 2013 at 12:22 PM, Laura Hale la...@fanhistory.com wrote: Why? The question of moving them into the WMF fold was pretty much no, because the project has an overlapping purpose with Wiktionary, This is not actually the case. There was overwhelming community support for adopting Omegawiki - at least simply providing hosting. It stalled because the code needed a security and style review, and Kip (the lead developer) was going to put some time into that. The OW editors and dev were very interested in finding a way forward that involved Wikidata and led to a combined project with a single repository of terms, meanings, definitions and translations. Recap: The page describing the OmegaWiki project satisfies all of the criteria for requesting WMF adoption. * It is well-defined on Meta http://meta.wikimedia.org/wiki/Omegawiki * It describes an interesting idea clearly aligned with expanding the scope of free knowledge * It is not a 'competing' project to Wiktionaries; it is an idea that grew out of the Wiktionary community, has been developed for years alongside it, and shares many active contributors and linguiaphiles. * It started an RfC which garnered 85% support for adoption. http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki Even if the current OW code is not used at all for a future Wiktionary update -- and this idea was proposed and taken seriously by the OW devs -- their community of contributors should be part of discussions about how to solve the Wiktionary problem that they were the first to dedicate themselves to. Regards, Sam. ___ Wikimedia-l mailing list wikimedi...@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] [Wikimedia-l] Meeting about the support of Wiktionary in Wikidata
To add up a couple of comments to what Denny said, from my experience with Wikisource, reaching out to international, loosely connected communities is already a big challenge on its own. I would like to invite Wiktionary contributors to take a look to this Individual Engagement Grant project that Aubrey and me are doing for Wikisource, because maybe it would make sense that a group of involved Wiktionarians started a similar initiative for Wiktionary. The original application can be found here: http://meta.wikimedia.org/wiki/Grants:IEG/Elaborate_Wikisource_strategic_vision And the midterm report: http://meta.wikimedia.org/wiki/Grants:IEG/Elaborate_Wikisource_strategic_vision If anyone from the Wiktionary community wants to step forward, I would be more than happy to share experiences and provide advice. Cheers, Micru On Sat, Aug 10, 2013 at 3:30 AM, Denny Vrandečić denny.vrande...@wikimedia.de wrote: [Sorry for cross-posting] Yes, I agree that the OmegaWiki community should be involved in the discussions, and I pointed GerardM to our proposals whenever and discussions, using him as a liaison. We also looked and keep looking at the OmegaWiki data model to see what we are missing. Our latest proposal is different from OmegaWiki in two major points: * our primary goal is to provide support for structured data in the Wiktionaries. We do not plan to be the main resource ourselves, where readers come to in order to look up something, we merely provide structured data that a Wiktionary may or may not use. This parallels the role of Wikidata has with regards to Wikipedia. This also highlights the difference between Wikidata and OmegaWiki, since OmegaWiki's goal is to create a dictionary of all words of all languages, including lexical, terminological and ontological information. * a smaller difference is the data model. Wikidata's latest proposal to support Wiktionary is centered around lexemes, and we do not assume that there is such a things as a language-independent defined meaning. But no matter what model we end up with, it is important to ensure that the bulk of the data could freely flow between the projects, and even though we might disagree on this issue in the modeling, it is ensured that the exchange of data is widely possible. We tried to keep notes on the discussion we had today: http://epl.wikimedia.org/p/WiktionaryAndWikidata My major take home message for me is that: * the proposal needs more visual elements, especially a mock-up or sketch of how it would look like and how it could be used on the Wiktionaries * there is no generally accepted place for a discussion that involves all Wiktionary projects. Still, my initial decision to have the discussion on the Wikidata wiki was not a good one, and it should and will be moved to Meta. Having said that, the current proposal for the data model of how to support Wiktionary with Wikidata seems to have garnered a lot of support so far. So this is what I will continue building upon. Further comments are extremely welcomed. You can find it here: http://www.wikidata.org/wiki/Wikidata:Wiktionary As said, it will be moved to Meta, as soon as the requested mockups and extensions are done. Cheers, Denny 2013/8/10 Samuel Klein meta...@gmail.com Hello, On Fri, Aug 9, 2013 at 6:13 PM, JP Béland lebo.bel...@gmail.com wrote: I agree. We also need to include the Omegawiki community. Agreed. On Fri, Aug 9, 2013 at 12:22 PM, Laura Hale la...@fanhistory.com wrote: Why? The question of moving them into the WMF fold was pretty much no, because the project has an overlapping purpose with Wiktionary, This is not actually the case. There was overwhelming community support for adopting Omegawiki - at least simply providing hosting. It stalled because the code needed a security and style review, and Kip (the lead developer) was going to put some time into that. The OW editors and dev were very interested in finding a way forward that involved Wikidata and led to a combined project with a single repository of terms, meanings, definitions and translations. Recap: The page describing the OmegaWiki project satisfies all of the criteria for requesting WMF adoption. * It is well-defined on Meta http://meta.wikimedia.org/wiki/Omegawiki * It describes an interesting idea clearly aligned with expanding the scope of free knowledge * It is not a 'competing' project to Wiktionaries; it is an idea that grew out of the Wiktionary community, has been developed for years alongside it, and shares many active contributors and linguiaphiles. * It started an RfC which garnered 85% support for adoption. http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki Even if the current OW code is not used at all for a future Wiktionary update -- and this idea was proposed and taken seriously by the OW devs -- their community of contributors should be part of
Re: [Wikidata-l] question about 2 different json formats
On Wed, Aug 7, 2013 at 10:11 PM, Denny Vrandečić denny.vrande...@wikimedia.de wrote: Hi Anthony, that's the internal data structure, and this is bound to change without notice. I am sorry if this caused trouble. If this is a common concern, we will start documenting and announcing those changes. It really should only concern the people processing the XML dumps. We would prefer to actually create a more stable output dump of the knowledge - I guess this would be more appreciated (like the RDF dump that Markus has posted about recently). The call to get the item description should have been https://www.wikidata.org/w/api.php?action=wbgetentitiesformat=jsonids=Q1 This should provide you with a more stable answer. Cheers, Denny 2013/8/1 Huidong Zhang anthonyzh...@google.com Hi, I noticed that the response from http://www.wikidata.org/w/api.php?action=querytitles=Q1prop=revisionsrvprop=contentformat=xml; changed from entity:q1 to entity:[item,1]. Is this change applied to all pages? In the latest wikidata dump ( http://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-meta-current.xml.bz2), both formats exist at the same time. For example, page Q100 has: entity:[item,100], while page Q10 has entity:q10. Is it expected? Will the next dump have same format? By the way, http://www.wikidata.org/w/api.php?action=querytitles=Q10prop=revisionsrvprop=contentformat=xml; return entity:[item,10]. About the inconsistency in the dump file, is there any bug entry created for this? (I can create one, if anyone can point me the proper place to do that). Thanks. -- Best wishes, Anthony Zhang (Huidong Zhang) ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Jiang BIAN This email may be confidential or privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it went to the wrong person. Thanks. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] question about 2 different json formats
On 10-08-2013 10:54, Jiang BIAN wrote: On Wed, Aug 7, 2013 at 10:11 PM, Denny Vrandečić denny.vrande...@wikimedia.de mailto:denny.vrande...@wikimedia.de wrote: Hi Anthony, that's the internal data structure, and this is bound to change without notice. I am sorry if this caused trouble. If this is a common concern, we will start documenting and announcing those changes. It really should only concern the people processing the XML dumps. I am one of the people processing the XML dumps, and I don't think it is a big deal. But I have had to change my parser many times to be able to parse new dumps because of changes in the format (in most cases, but not always, because of new features), I just adapt to the changes without fuss, but if the format was documented I could file bug reports whenever the format is deviating from the documentation which might be helpful to the developers. (BTW, the time values seems to be OK again, after many syntax errors in the beginning. But the coordinate values have some strange (probably erroneous?) variations: Values where the precision and/or globe is given as null, and values where the globe is given as the string earth instead of an entity). About the inconsistency in the dump file, is there any bug entry created for this? (I can create one, if anyone can point me the proper place to do that). Not for my sake. I adapted to two entity formats in the dumps immediately when the new format started to appear. Best regards, - Byrial ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata language codes
The language code no is the metacode for Norwegian, and nowiki was in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk. The later split of and made nnwiki, but nowiki continued as before. After a while all Nynorsk content was migrated. Now nowiki has content in Bokmål and Riksmål, first one is official in Norway and the later is an unofficial variant. After the last additions to Bokmål there are very few forms that are only legal n Riksmål, so for all practical purposes nowiki has become a pure Bokmål wiki. I think all content in Wikidata should use either nn or nb, and all existing content with no as language code should be folded into nb. It would be nice if no could be used as an alias for nb, as this is de facto situation now, but it is probably not necessary and could create a discussion with the Nynorsk community. The site code should be nowiki as long as the community does not ask for a change. jeblad On 8/6/13, Markus Krötzsch mar...@semantic-mediawiki.org wrote: Hi Purodha, thanks for the helpful hints. I have implemented most of these now in the list on git (this is also where you can see the private codes I have created where needed). I don't see a big problem in changing the codes in future exports if better options become available (it's much easier than changing codes used internally). One open question that I still have is what it means if a language that usually has a script tag appears without such a tag (zh vs. zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that we do not know which script is used under this code (either could appear)? The other question is about the duplicate language tags, such as 'crh' and 'crh-Latn', which both appear in the data but are mapped to the same code. Maybe one of the codes is just phased out and will disappear over time? I guess the Wikidata team needs to answer this. We also have some codes that mean the same according to IANA, namely kk and kk-Cyrl, but which are currently not mapped to the same canonical IANA code. Finally, I wondered about Norwegian. I gather that no.wikipedia.org is in Norwegian Bokmål (nb), which is how I map the site now. However, the language data in the dumps (not the site data) uses both no and nb. Moreover, many items have different texts for nb and no. I wonder if both are still Bokmål, and there is just a bug that allows people to enter texts for nb under two language settings (for descriptions this could easily be a different text, even if in the same language). We also have nn, and I did not check how this relates to no (same text or different?). Cheers, Markus On 05/08/13 15:41, P. Blissenbach wrote: Hi Markus, Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change, once dialect codes of Serbian are added to the IANA subtag registry at http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see: http://www-01.sil.org/iso639-3/documentation.asp?id=nrm We rather use it for the Norman / Nourmaud, as described in http://en.wikipedia.org/wiki/Norman_language The Norman language is recognized by the linguist list and many others but as of yet not present in ISO 639-3. It should probably be suggested to be added. We should probaly map it to a private code meanwhile. Our code 'ksh' is currently being used to represent a superset of what it stands for in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the only Ripuarian variety (of dozens) having a code, to represent the whole lot. We should probably suggest to add a group code to ISO 639, and at least the dozen+ Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian meanwhile. Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not guaranteed to be in the languages of the Wikipedias. They are often in German instead. Details to be found in their respective page titleing rules. Moreover, for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias, texts are not, or quite often incorrectly, labelled as belonging to a certain dialect. See also: http://meta.wikimedia.org/wiki/Special_language_codes Greetings -- Purodha *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr *Von:* Markus Krötzsch mar...@semantic-mediawiki.org *An:* Federico Leva (Nemo) nemow...@gmail.com *Cc:* Discussion list for the Wikidata project. wikidata-l@lists.wikimedia.org *Betreff:* [Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available) Small update: I went through the language list at https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472 and added a number of TODOs to the most obvious problematic cases. Typical problems are: *
Re: [Wikidata-l] question about 2 different json formats
On 10/08/13 10:29, Byrial Jensen wrote: ... (BTW, the time values seems to be OK again, after many syntax errors in the beginning. But the coordinate values have some strange (probably erroneous?) variations: Values where the precision and/or globe is given as null, and values where the globe is given as the string earth instead of an entity). Thanks for the warning. This was something that has been causing problems in the RDF dump too. I am now validating the globe settings more carefully. Cheers, Markus About the inconsistency in the dump file, is there any bug entry created for this? (I can create one, if anyone can point me the proper place to do that). Not for my sake. I adapted to two entity formats in the dumps immediately when the new format started to appear. Best regards, - Byrial ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF export available
Good morning. I just found a bug that was caused by a bug in the Wikidata dumps (a value that should be a URI was not). This led to a few dozen lines with illegal qnames of the form w: . The updated script fixes this. Cheers, Markus On 09/08/13 18:15, Markus Krötzsch wrote: Hi Sebastian, On 09/08/13 15:44, Sebastian Hellmann wrote: Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump. You mean just as in at around 15:30 today ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development). I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try. I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem: Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement simple serializers for RDF. They all failed. This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to: 1. use a Python RDF framework 2. do some syntax tests on the output, e.g. with rapper 3. use a line by line format, e.g. use turtle without prefixes and just one triple per line (It's like NTriples, but with Unicode) Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require some tweaking. We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra somewhere (not confirmed) and we are still searching for it since the dump is so big Ok, looking forward to hear about the results of your search. A good tip for checking such things is to use grep. I did a quick grep on my current local statements export to count the numbers of and (this takes less than a minute on my laptop, including on-the-fly decompression). Both numbers were equal, making it unlikely that there is any unmatched in the current dumps. Then I used grep to check that and only occur in the statements files in lines with commons URLs. These are created using urllib, so there should never be any or in them. so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization. Not sure what you mean here. Turtle prefixes in general seem to be a Good Thing, not just for reducing the file size. The code has no easy way to get rid of prefixes, but if you want a line-by-line export you could subclass my exporter and overwrite the methods for incremental triple writing so that they remember the last subject (or property) and create full triples instead. This would give you a line-by-line export in (almost) no time (some uses of [...] blocks in object positions would remain, but maybe you could live with that). Best wishes, Markus All the best, Sebastian Am 03.08.2013 23:22, schrieb Markus Krötzsch: Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere). Markus On 03/08/13 14:48, Markus Krötzsch wrote: Hi, I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file
Re: [Wikidata-l] Wikidata RDF export available
Hi Markus! Thank you very much. Regarding your last email: Of course, I am aware of your arguments in your last email, that the dump is not official. Nevertheless, I am expecting you and others to code (or supervise) similar RDF dumping projects in the future. Here are two really important things to consider: 1. Always use a mature RDF framework for serializing: Even DBpedia was publishing RDF for years that had some errors in it, this was really frustrating for maintainers (handling bug reports) and clients (trying to quick-fix it). Other small projects (in fact exactly the same as yours Markus, a guy publishing some useful software) went the same way: Lot's of small syntax bugs, many bug requests, lot of additional work. Some of them were abandoned because the developer didn't have time anymore. 2. Use NTriples or one-triple-per-line Turtle: (Turtle supports IRIs and unicode, compare) curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 | bzcat | head one-triple-per-line let's you a) find errors easier and b) aids further processing, e.g. calculate the outdegree of subjects: curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head -100 | cut -f1 -d '' | grep -v '^#' | sed 's///;s///' | awk '{count[$1]++}END{for(j in count) print j \tcount [j]}' Furthermore: - Parsers can treat one-triple-per-line more robust, by just skipping lines - compression size is the same - alphabetical ordering of data works well (e.g. for GitHub diffs) - you can split the files in several smaller files easily Blank nodes have some bad properties: - some databases react weird to them and they sometimes fill up indexes and make the DB slow (depends on the implementations of course, this is just my experience ) - make splitting one-triple-per-line more difficult - difficult for SPARQL to resolve recursively - see http://videolectures.net/iswc2011_mallea_nodes/ or http://web.ing.puc.cl/~marenas/publications/iswc11.pdf Turtle prefixes: Why do you think they are a good thing? They are disputed as sometimes as a premature feature. They do make data more readable, but nobody is going to read 4.4 GB of Turtle. By the way, you can always convert it to turtle easily: curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head -100 | rapper -i turtle -o turtle -I - - file All the best, Sebastian Am 10.08.2013 12:44, schrieb Markus Krötzsch: Good morning. I just found a bug that was caused by a bug in the Wikidata dumps (a value that should be a URI was not). This led to a few dozen lines with illegal qnames of the form w: . The updated script fixes this. Cheers, Markus On 09/08/13 18:15, Markus Krötzsch wrote: Hi Sebastian, On 09/08/13 15:44, Sebastian Hellmann wrote: Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump. You mean just as in at around 15:30 today ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development). I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try. I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem: Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement simple serializers for RDF. They all failed. This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to: 1. use a Python RDF framework 2. do some syntax tests on the output, e.g. with rapper 3. use a line by line format, e.g. use turtle without prefixes and just one triple per line (It's like NTriples, but with Unicode) Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require
[Wikidata-l] Wikidata slides on Wikimania2013
Hi, Is there a place that I can find the slides used on this Wikimania? How about link them on the submission page, e.g. State of Wikidatahttp://wikimania2013.wikimedia.org/wiki/Submissions/State_of_Wikidata . Thanks -- Jiang BIAN This email may be confidential or privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it went to the wrong person. Thanks. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata slides on Wikimania2013
Jiang BIAN, 10/08/2013 18:48: Hi, Is there a place that I can find the slides used on this Wikimania? They have to go here: https://commons.wikimedia.org/wiki/Category:Wikimania_2013_presentation_slides Poke by email the presenters who didn't upload them... How about link them on the submission page, e.g. State of Wikidata http://wikimania2013.wikimedia.org/wiki/Submissions/State_of_Wikidata. Yes, please link or transclude on their wiki page all those you can find. Nemo ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata language codes
On 10/08/13 11:07, John Erling Blad wrote: The language code no is the metacode for Norwegian, and nowiki was in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk. The later split of and made nnwiki, but nowiki continued as before. After a while all Nynorsk content was migrated. Now nowiki has content in Bokmål and Riksmål, first one is official in Norway and the later is an unofficial variant. After the last additions to Bokmål there are very few forms that are only legal n Riksmål, so for all practical purposes nowiki has become a pure Bokmål wiki. I think all content in Wikidata should use either nn or nb, and all existing content with no as language code should be folded into nb. It would be nice if no could be used as an alias for nb, as this is de facto situation now, but it is probably not necessary and could create a discussion with the Nynorsk community. The site code should be nowiki as long as the community does not ask for a change. Thanks for the clarification. I will keep no to mean no for now. What I wonder is: if users choose to enter a no label on Wikidata, what is the language setting that they see? Does this say Norwegian (any variant) or what? That's what puzzles me. I know that a Wikipedia can allow multiple languages (or dialects) to coexist, but in the Wikidata language selector I thought you can only select real languages, not language groups. Markus On 8/6/13, Markus Krötzsch mar...@semantic-mediawiki.org wrote: Hi Purodha, thanks for the helpful hints. I have implemented most of these now in the list on git (this is also where you can see the private codes I have created where needed). I don't see a big problem in changing the codes in future exports if better options become available (it's much easier than changing codes used internally). One open question that I still have is what it means if a language that usually has a script tag appears without such a tag (zh vs. zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that we do not know which script is used under this code (either could appear)? The other question is about the duplicate language tags, such as 'crh' and 'crh-Latn', which both appear in the data but are mapped to the same code. Maybe one of the codes is just phased out and will disappear over time? I guess the Wikidata team needs to answer this. We also have some codes that mean the same according to IANA, namely kk and kk-Cyrl, but which are currently not mapped to the same canonical IANA code. Finally, I wondered about Norwegian. I gather that no.wikipedia.org is in Norwegian Bokmål (nb), which is how I map the site now. However, the language data in the dumps (not the site data) uses both no and nb. Moreover, many items have different texts for nb and no. I wonder if both are still Bokmål, and there is just a bug that allows people to enter texts for nb under two language settings (for descriptions this could easily be a different text, even if in the same language). We also have nn, and I did not check how this relates to no (same text or different?). Cheers, Markus On 05/08/13 15:41, P. Blissenbach wrote: Hi Markus, Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change, once dialect codes of Serbian are added to the IANA subtag registry at http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see: http://www-01.sil.org/iso639-3/documentation.asp?id=nrm We rather use it for the Norman / Nourmaud, as described in http://en.wikipedia.org/wiki/Norman_language The Norman language is recognized by the linguist list and many others but as of yet not present in ISO 639-3. It should probably be suggested to be added. We should probaly map it to a private code meanwhile. Our code 'ksh' is currently being used to represent a superset of what it stands for in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the only Ripuarian variety (of dozens) having a code, to represent the whole lot. We should probably suggest to add a group code to ISO 639, and at least the dozen+ Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian meanwhile. Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not guaranteed to be in the languages of the Wikipedias. They are often in German instead. Details to be found in their respective page titleing rules. Moreover, for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias, texts are not, or quite often incorrectly, labelled as belonging to a certain dialect. See also: http://meta.wikimedia.org/wiki/Special_language_codes Greetings -- Purodha *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr *Von:* Markus Krötzsch mar...@semantic-mediawiki.org *An:* Federico Leva (Nemo)
Re: [Wikidata-l] Scope of a Wikidata entry
Yes, I think multiple identities attached to a single wikidata entity is the way to go forward. ~We talked about this briefly at Wikimania on Friday and the consensus was still a bit unclear ;-) Once qualifiers are properly up and running we might be able to mark them as preferred or main relation vs. secondary identifiers (the main VIAF cluster vs the isolated entries, for example) A. On Sunday, 11 August 2013, Luca Martinelli wrote: 2013/7/31 Andrew Gray andrew.g...@dunelm.org.uk javascript:;: Hi Nicholas, a) Yes, it is about the person and the aliases together. As a general rule, it's one article per person, not per name. b) Different names is a quirk of the Wikipedia background - these default to the title of the Wikipedia article on that person, and there's no agreement on whether to put the article under the person or the more famous pseudonym. FYI, there is now a property for pseudonyms ( http://www.wikidata.org/wiki/Property:P742 ). d) I think the initial assumption was that there was a 1=1 match, but if there are multiple musicbrainz id's representing facets of the same entity, then Wikidata will support adding several. It is possible to put several IDs coming from the same database. Actually, I'm trying to do this with multiple VIAF codes referring to the same author, and it could also become a feedback to the original database. -- Luca Sannita Martinelli http://it.wikipedia.org/wiki/Utente:Sannita ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l