The most important thing for my problem would be to encode quantity and geopos. The test case is lake sizes to encode proper localized descriptions.
Unless someone already have a working solution I would encode this as sparse logarithmic vectors, probably also with log of pairwise differences. Encoding of qualifiers is interesting, but would require encoding of a topic map, and that adds an additional layer of complexity. How to encode the values are not so much the problem, but avoiding reimplementing this yet another time… ;) On Wed, Sep 27, 2017 at 1:23 PM, Thomas Pellissier Tanon < [email protected]> wrote: > Just an idea of a very sparse but hopefully not so bad encoding (I have > not actually tested it). > > NB: I am going to use a lot the terms defined in the glossary [1]. > > A value could be encoded by a vector: > - for entity ids it is a vector V that have the dimension of the number of > existing entities such that V[q] = 1 if, and only if, it is the entity q > and V[q] = 0 if not. > - for time : a vector with year, month, day, hours, minutes, seconds, > is_precision_year, is_precision_month, ..., is_gregorian, is_julian (or > something similar) > - for geo coordinates latitude, longitude, is_earth, is_moon... > - string/language strings: an encoding depending on your use case > ... > Example : To encode "Q2" you would have the vector {0,1,0....} > To encode the year 2000 you would have {2000,0..., is_precision_decade = > 0,is_precision_year=1,is_precision_month=0,...,is_gregorian=true,...} > > To encode a snak you build a big vector by concatenating the vector of the > value if it is P1, if it is P2... (you use the property datatype to pick a > good vector shape) + you add two cells per property to encode is_novalue, > is_somevalue. To encode "P31: Q5" you would have a vector V = > {0,....,0,0,0,0,1,0,....} with 1 only for V[P31_offset + Q5_offset] > > To encode a claim you could concatenate the main snak vector + the > qualifiers vectors that is the merge of the snak vector for all qualifiers > (i.e. you build the vector for all snak and you sum them) such that the > qualifier vectors encode all qualifiers at the same time. it allows to > check that a qualifiers is set just by picking the right cell in the > vector. But it will do bad things if there are two qualifiers with the same > property and having a datatype like time or geocoordinates. But I don't > think it really a problem. > Example: to encode the claim with "P31: Q5" main snak and qualifiers "P42: > Q42, P42: Q44" we would have a vector V such that V[P31_offset + Q5_offset] > = 1, V[qualifiers_offset + P42_offset + Q42_offset] = 1 and > V[qualifiers_offset + P42_offset + Q44_offset] = 1 and 0 elsewhere. > > I am not sure how to encode statements references (merge all of them and > encode it just like the qualifiers vector is maybe a first step but is bad > if we have multiple references). For the rank you just need 3 booleans > is_preferred, is_normal and is_deprecated. > > Cheers, > > Thomas > > [1] https://www.wikidata.org/wiki/Wikidata:Glossary > > > > Le 27 sept. 2017 à 12:41, John Erling Blad <[email protected]> a écrit : > > > > Is there anyone that has done any work on how to encode statements as > features for neural nets? I'm mostly interested in sparse encoders for > online training of live networks. > > > > > > _______________________________________________ > > Wikidata mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikidata > > > _______________________________________________ > Wikidata mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata > >
_______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
