Just an idea of a very sparse but hopefully not so bad encoding (I have not 
actually tested it).

NB: I am going to use a lot the terms defined in the glossary [1].

A value could be encoded by a vector:
- for entity ids it is a vector V that have the dimension of the number of 
existing entities such that V[q] = 1 if, and only if, it is the entity q and 
V[q] = 0 if not.
- for time : a vector with year, month, day, hours, minutes, seconds, 
is_precision_year, is_precision_month, ..., is_gregorian, is_julian (or 
something similar)
- for geo coordinates latitude, longitude, is_earth, is_moon...
- string/language strings: an encoding depending on your use case
...
Example : To encode "Q2" you would have the vector {0,1,0....}
To encode the year 2000 you would have {2000,0..., is_precision_decade = 
0,is_precision_year=1,is_precision_month=0,...,is_gregorian=true,...}

To encode a snak you build a big vector by concatenating the vector of the 
value if it is P1, if it is P2... (you use the property datatype to pick a good 
vector shape) + you add two cells per property to encode is_novalue, 
is_somevalue. To encode "P31: Q5" you would have a vector V = 
{0,....,0,0,0,0,1,0,....} with 1 only for  V[P31_offset + Q5_offset]

To encode a claim you could concatenate the main snak vector + the qualifiers 
vectors that is the merge of the snak vector for all qualifiers (i.e. you build 
the vector for all snak and you sum them) such that the qualifier vectors 
encode all qualifiers at the same time. it allows to check that a qualifiers is 
set just by picking the right cell in the vector. But it will do bad things if 
there are two qualifiers with the same property and having a datatype like time 
or geocoordinates. But I don't think it really a problem.
Example: to encode the claim with "P31: Q5" main snak and qualifiers "P42: Q42, 
P42: Q44" we would have a vector V such that V[P31_offset + Q5_offset] = 1, 
V[qualifiers_offset + P42_offset + Q42_offset] = 1 and V[qualifiers_offset + 
P42_offset + Q44_offset] = 1 and 0 elsewhere.

I am not sure how to encode statements references (merge all of them and encode 
it just like the qualifiers vector is maybe a first step but is bad if we have 
multiple references).  For the rank you just need 3 booleans is_preferred, 
is_normal and is_deprecated.

Cheers,

Thomas

[1] https://www.wikidata.org/wiki/Wikidata:Glossary


> Le 27 sept. 2017 à 12:41, John Erling Blad <[email protected]> a écrit :
> 
> Is there anyone that has done any work on how to encode statements as 
> features for neural nets? I'm mostly interested in sparse encoders for online 
> training of live networks.
> 
> 
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata

Attachment: signature.asc
Description: Message signed with OpenPGP

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to