Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

2016-11-28 Thread gnosygnu
> If you have problems accessing the datatype from Lua or elsewhere, let me 
> know.

Honestly, I haven't tried. Just so you know, I'm the developer of XOWA
which is an offline wiki app in Java. As such, I'm accessing Wikidata
data directly, not through the Wikibase code. (If you're curious, I
also use it to recreate Wikidata locally as well. See:
http://xowa.org/home/file/screenshot_wikidata.png)

Going forward, I'll double-check that my Wikidata issues are not
related to my not using Wikibase. Again, my thanks to you for clearing
that up.

> It's always cool to see that people use our data and our software!

Yup. Wikidata is very cool in concept and in practice. It's amazing to
have a single, multi-lingual, verifiable repository of facts / details
-- all free and open-content. Kudos to you and your team for the
excellent work!

On Mon, Nov 28, 2016 at 11:39 AM, Daniel Kinzler
 wrote:
> Am 28.11.2016 um 17:34 schrieb gnosygnu:
>>> The datatype is implicit, it can be derived from the property ID. You can 
>>> find
>>> it by looking at the Property page's JSON.
>>> ...
>>
>> Thanks for all the info. I see my error. I didn't realize that
>> mainsnak.datatype was inferred. I assumed it would have to be embedded
>> directly in the XML's JSON  (partly because it is embedded directly in
>> the JSON's dump JSON)
>>
>> The rest of your points make sense. Thanks again for taking the time to 
>> clarify.
>
> If you have problems accessing the datatype from Lua or elsewhere, let me 
> know.
> There may be issues with the import process.
>
> It's always cool to see that people use our data and our software!
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

2016-11-28 Thread Daniel Kinzler
Am 28.11.2016 um 17:34 schrieb gnosygnu:
>> The datatype is implicit, it can be derived from the property ID. You can 
>> find
>> it by looking at the Property page's JSON.
>> ...
> 
> Thanks for all the info. I see my error. I didn't realize that
> mainsnak.datatype was inferred. I assumed it would have to be embedded
> directly in the XML's JSON  (partly because it is embedded directly in
> the JSON's dump JSON)
> 
> The rest of your points make sense. Thanks again for taking the time to 
> clarify.

If you have problems accessing the datatype from Lua or elsewhere, let me know.
There may be issues with the import process.

It's always cool to see that people use our data and our software!


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

2016-11-28 Thread Daniel Kinzler
Am 28.11.2016 um 16:31 schrieb gnosygnu:
>> If you are also using the same software (Wikibase on MediaWiki), the XML 
>> dumps
>> should Just Work (tm). The idea of the XML dumps is that the "text" blobs are
>> opaque to 3rd parties, but will continue to work with future versions of
>> MediaWiki & friends (with a compatible configuration - which is rather 
>> tricky).
> 
> Not sure I follow. Even from a Wikibase on MediaWiki perspective, the
> XML dumps are still incomplete (since they're missing
> mainsnak.datatype).

The datatype is implicit, it can be derived from the property ID. You can find
it by looking at the Property page's JSON.

The XML dumps are complete by definition, since they contain a raw copy of the
primary data blob. All other data is derived from this. However, since they are
"raw", they are not easy to process by consumers, and we make no guarantees
regarding the raw data format.

We include the data type in the statements of the canonical JSON dumps for
convenience. We are planning to add more things to the JSON output for
convenience. That does not make the XML dumps incomplete.

You use case is special since you want canonical JSON *and* wikitext. I'm afraid
you will have to process both kinds of dumps.

> One line of the file specifically checks for datatype: "if datatype
> and datatype == 'commonsMedia' then". This line always evaluates to
> false, even though you are looking at an entity (Q38: Italy) and
> property (P41: flag image) which does have a datatype for
> "commonsMedia" (since the XML dump does not have "mainsnak.datatype").

That is incorrect. datatype will always be set in Lua, even if it is not present
in the XML. Remember that it is not present in the primary blob on Wikidata
either. Wikibase will look it up internally, from the wb_property_info table,
and make that information available to Lua.

When loading the XML file, a lot of secondary information is extracted into
database tables for this kind of use, e.g. all the labels and descriptions go
into the wb_terms table, property types go into wb_property_info, links to other
items go to page_links, etc.

Actually, you may have to run refreshLinks.php or rebuildall.php after doing the
XML import, I'm not sure which is needed when any more. But the point is: the
XML dump contains all information needed to reconstruct the content. This is
true for wikitext as well as for Wikibase JSON data. All derived information is
extracted upon import, and is made available via the respective APIs, including
Lua, just like on Wikidata.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

2016-11-28 Thread gnosygnu
> If you are also using the same software (Wikibase on MediaWiki), the XML dumps
> should Just Work (tm). The idea of the XML dumps is that the "text" blobs are
> opaque to 3rd parties, but will continue to work with future versions of
> MediaWiki & friends (with a compatible configuration - which is rather 
> tricky).

Not sure I follow. Even from a Wikibase on MediaWiki perspective, the
XML dumps are still incomplete (since they're missing
mainsnak.datatype).

For example, consider the following:
* You download only the XML dump pages-articles.xml dump from
https://dumps.wikimedia.org/wikidatawiki/latest/
* You load it into MediaWiki
* You then create a module that looks like the Wikidata Module from
Russian Wikipedia:
https://ru.wikipedia.org/w/index.php?title=Module:Wikidata=edit

One line of the file specifically checks for datatype: "if datatype
and datatype == 'commonsMedia' then". This line always evaluates to
false, even though you are looking at an entity (Q38: Italy) and
property (P41: flag image) which does have a datatype for
"commonsMedia" (since the XML dump does not have "mainsnak.datatype").

From a user standpoint, this means that if you're trying to set up a
local version of Russian Wikipedia and Wikidata, then all Country
infoboxes will not show the country's flag (the above line of code
will substitute text for the image)

The only way around this is to supplement the XML dump with the JSON
dump. But then, you'll need to download 2 large dumps and somehow
merge them. (I don't know if MediaWiki has a facility to load the JSON
dump, much less merge it)

Anyway, I understand that there are technical complications with
trying to add mainsnak.datatype to the XML dumps. But if this never
gets resolved, then the current situation basically offers two
unsatisfying options:
* Have an XML dump which is 99.9% complete but still missing key info
(mainsnak.datatype)
* Try to merge the JSON dump into the XML dump (which MediaWiki may
not be able to do)

Hope this makes sense.

Thanks.

On Sun, Nov 27, 2016 at 11:49 AM, Daniel Kinzler
 wrote:
> Am 27.11.2016 um 01:15 schrieb gnosygnu:
>> This is useful, but unfortunately it won't suffice. Wikidata also has
>> pages which are wikitext (for example,
>> https://www.wikidata.org/wiki/Wikidata:WikiProject_Names). These
>> wikitext pages are in the XML dumps, but aren't in the stub dumps nor
>> the JSON dumps. I actually do use these Wikidata wikitext entries to
>> try to reproduce Wikidata in its entirety.
>
> If you are also using the same software (Wikibase on MediaWiki), the XML dumps
> should Just Work (tm). The idea of the XML dumps is that the "text" blobs are
> opaque to 3rd parties, but will continue to work with future versions of
> MediaWiki & friends (with a compatible configuration - which is rather 
> tricky).
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

2016-11-27 Thread Daniel Kinzler
Am 27.11.2016 um 01:15 schrieb gnosygnu:
> This is useful, but unfortunately it won't suffice. Wikidata also has
> pages which are wikitext (for example,
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Names). These
> wikitext pages are in the XML dumps, but aren't in the stub dumps nor
> the JSON dumps. I actually do use these Wikidata wikitext entries to
> try to reproduce Wikidata in its entirety. 

If you are also using the same software (Wikibase on MediaWiki), the XML dumps
should Just Work (tm). The idea of the XML dumps is that the "text" blobs are
opaque to 3rd parties, but will continue to work with future versions of
MediaWiki & friends (with a compatible configuration - which is rather tricky).


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

2016-11-26 Thread gnosygnu
Hi Daniel,

Thanks for the quick and helpful reply. I was hoping that the XML
dumps could be changed, but I understand now that the JSON dumps are
the recommended format.

> To avoid downloading redundant information, you can use one of the
> wikidatawiki-20161120-stub-* dumps instead of the full page dumps

This is useful, but unfortunately it won't suffice. Wikidata also has
pages which are wikitext (for example,
https://www.wikidata.org/wiki/Wikidata:WikiProject_Names). These
wikitext pages are in the XML dumps, but aren't in the stub dumps nor
the JSON dumps. I actually do use these Wikidata wikitext entries to
try to reproduce Wikidata in its entirety. So for now, it looks like
both XML dumps and JSON dumps will be required.

At any rate, thanks again for the excellent reply.


On Sat, Nov 26, 2016 at 12:25 PM, Daniel Kinzler
 wrote:
> Hi gnosygnu!
>
> The JSON in the XML dumps is the raw contents of the storage backend. It can't
> be changed retroactively, and re-encoding everything on the fly would be too
> expensive. Also, the JSON embedded in the XML files is not officially 
> supported
> as a stable interface of Wikibase. The JSON format in the XML files can change
> without notice, and you may encounter different representations even within 
> the
> same dump.
>
> I recommend to use the JSON dumps, they contain our data in canonical form. To
> avoid downloading redundant information, you can use one of the
> wikidatawiki-20161120-stub-* dumps instead of the full page dumps. These don't
> contain the actual page content, just meta-data.
>
> Caveat: there is currently no dump that contains the JSON of old revisions of
> entities in canonical form. You can only get them individually from
> Special:EntityData, e.g.
> 
>
> HTH
> -- daniel
>
> Am 26.11.2016 um 02:13 schrieb gnosygnu:
>> Hi everyone. I have a question about the Wikidata xml dump, but I'm
>> posting this question here, because it looks more related to Wikidata.
>>
>> In short, it seems that the "pages-articles.xml" does not include the
>> datatype property for snaks. For example, the xml dump does not list a
>> datatype for Q38 (Italy) and P41 (flag image). In contrast, the json
>> dump does list a datatype of "commonsMedia".
>>
>> Can this datatype property be included in future xml dumps? The
>> alternative would be to download two large and redundant dumps (xml
>> and json) in order to reconstruct a Wikidata instance.
>>
>> More information is provided below the break. Let me know if you need
>> anything else.
>>
>> Thanks.
>>
>> 
>>
>> Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag
>> image). Notice that there is no "datatype" property
>>   // 
>> https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-pages-articles.xml.bz2
>>   "mainsnak": {
>> "snaktype": "value",
>> "property": "P41",
>> "hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863",
>> "datavalue": {
>>   "value": "Flag of Italy.svg",
>>   "type": "string"
>> }
>>   },
>>
>> Meanwhile, the API and the JSON dump lists a datatype property of
>> "commonsMedia":
>>   // https://www.wikidata.org/w/api.php?action=wbgetentities=q38
>>   // 
>> https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-20161114-all.json.bz2
>>   "P41": [{
>> "mainsnak": {
>>   "snaktype": "value",
>>   "property": "P41",
>>   "datavalue": {
>> "value": "Flag of Italy.svg",
>> "type": "string"
>>   },
>>   "datatype": "commonsMedia"
>> },
>>
>> As far as I can tell, the Turtle (ttl) dump does not list a datatype
>> property either, but this may be because I don't understand its
>> format.
>>   wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D .
>>   wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement,
>>   wikibase:BestRank ;
>> wikibase:rank wikibase:NormalRank ;
>> ps:P41 
>> 
>> ;
>> pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ;
>> pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc .
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

2016-11-26 Thread Daniel Kinzler
Hi gnosygnu!

The JSON in the XML dumps is the raw contents of the storage backend. It can't
be changed retroactively, and re-encoding everything on the fly would be too
expensive. Also, the JSON embedded in the XML files is not officially supported
as a stable interface of Wikibase. The JSON format in the XML files can change
without notice, and you may encounter different representations even within the
same dump.

I recommend to use the JSON dumps, they contain our data in canonical form. To
avoid downloading redundant information, you can use one of the
wikidatawiki-20161120-stub-* dumps instead of the full page dumps. These don't
contain the actual page content, just meta-data.

Caveat: there is currently no dump that contains the JSON of old revisions of
entities in canonical form. You can only get them individually from
Special:EntityData, e.g.


HTH
-- daniel

Am 26.11.2016 um 02:13 schrieb gnosygnu:
> Hi everyone. I have a question about the Wikidata xml dump, but I'm
> posting this question here, because it looks more related to Wikidata.
> 
> In short, it seems that the "pages-articles.xml" does not include the
> datatype property for snaks. For example, the xml dump does not list a
> datatype for Q38 (Italy) and P41 (flag image). In contrast, the json
> dump does list a datatype of "commonsMedia".
> 
> Can this datatype property be included in future xml dumps? The
> alternative would be to download two large and redundant dumps (xml
> and json) in order to reconstruct a Wikidata instance.
> 
> More information is provided below the break. Let me know if you need
> anything else.
> 
> Thanks.
> 
> 
> 
> Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag
> image). Notice that there is no "datatype" property
>   // 
> https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-pages-articles.xml.bz2
>   "mainsnak": {
> "snaktype": "value",
> "property": "P41",
> "hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863",
> "datavalue": {
>   "value": "Flag of Italy.svg",
>   "type": "string"
> }
>   },
> 
> Meanwhile, the API and the JSON dump lists a datatype property of
> "commonsMedia":
>   // https://www.wikidata.org/w/api.php?action=wbgetentities=q38
>   // 
> https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-20161114-all.json.bz2
>   "P41": [{
> "mainsnak": {
>   "snaktype": "value",
>   "property": "P41",
>   "datavalue": {
> "value": "Flag of Italy.svg",
> "type": "string"
>   },
>   "datatype": "commonsMedia"
> },
> 
> As far as I can tell, the Turtle (ttl) dump does not list a datatype
> property either, but this may be because I don't understand its
> format.
>   wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D .
>   wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement,
>   wikibase:BestRank ;
> wikibase:rank wikibase:NormalRank ;
> ps:P41 
> 
> ;
> pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ;
> pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc .
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

2016-11-25 Thread gnosygnu
Hi everyone. I have a question about the Wikidata xml dump, but I'm
posting this question here, because it looks more related to Wikidata.

In short, it seems that the "pages-articles.xml" does not include the
datatype property for snaks. For example, the xml dump does not list a
datatype for Q38 (Italy) and P41 (flag image). In contrast, the json
dump does list a datatype of "commonsMedia".

Can this datatype property be included in future xml dumps? The
alternative would be to download two large and redundant dumps (xml
and json) in order to reconstruct a Wikidata instance.

More information is provided below the break. Let me know if you need
anything else.

Thanks.



Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag
image). Notice that there is no "datatype" property
  // 
https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-pages-articles.xml.bz2
  "mainsnak": {
"snaktype": "value",
"property": "P41",
"hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863",
"datavalue": {
  "value": "Flag of Italy.svg",
  "type": "string"
}
  },

Meanwhile, the API and the JSON dump lists a datatype property of
"commonsMedia":
  // https://www.wikidata.org/w/api.php?action=wbgetentities=q38
  // 
https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-20161114-all.json.bz2
  "P41": [{
"mainsnak": {
  "snaktype": "value",
  "property": "P41",
  "datavalue": {
"value": "Flag of Italy.svg",
"type": "string"
  },
  "datatype": "commonsMedia"
},

As far as I can tell, the Turtle (ttl) dump does not list a datatype
property either, but this may be because I don't understand its
format.
  wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D .
  wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement,
  wikibase:BestRank ;
wikibase:rank wikibase:NormalRank ;
ps:P41 

;
pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ;
pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc .

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata