The most important thing for my problem would be to encode quantity and
geopos. The test case is lake sizes to encode proper localized descriptions.

Unless someone already have a working solution I would encode this as
sparse logarithmic vectors, probably also with log of pairwise differences.

Encoding of qualifiers is interesting, but would require encoding of a
topic map, and that adds an additional layer of complexity.

How to encode the values are not so much the problem, but avoiding
reimplementing this yet another time… ;)

On Wed, Sep 27, 2017 at 1:23 PM, Thomas Pellissier Tanon <
[email protected]> wrote:

> Just an idea of a very sparse but hopefully not so bad encoding (I have
> not actually tested it).
>
> NB: I am going to use a lot the terms defined in the glossary [1].
>
> A value could be encoded by a vector:
> - for entity ids it is a vector V that have the dimension of the number of
> existing entities such that V[q] = 1 if, and only if, it is the entity q
> and V[q] = 0 if not.
> - for time : a vector with year, month, day, hours, minutes, seconds,
> is_precision_year, is_precision_month, ..., is_gregorian, is_julian (or
> something similar)
> - for geo coordinates latitude, longitude, is_earth, is_moon...
> - string/language strings: an encoding depending on your use case
> ...
> Example : To encode "Q2" you would have the vector {0,1,0....}
> To encode the year 2000 you would have {2000,0..., is_precision_decade =
> 0,is_precision_year=1,is_precision_month=0,...,is_gregorian=true,...}
>
> To encode a snak you build a big vector by concatenating the vector of the
> value if it is P1, if it is P2... (you use the property datatype to pick a
> good vector shape) + you add two cells per property to encode is_novalue,
> is_somevalue. To encode "P31: Q5" you would have a vector V =
> {0,....,0,0,0,0,1,0,....} with 1 only for  V[P31_offset + Q5_offset]
>
> To encode a claim you could concatenate the main snak vector + the
> qualifiers vectors that is the merge of the snak vector for all qualifiers
> (i.e. you build the vector for all snak and you sum them) such that the
> qualifier vectors encode all qualifiers at the same time. it allows to
> check that a qualifiers is set just by picking the right cell in the
> vector. But it will do bad things if there are two qualifiers with the same
> property and having a datatype like time or geocoordinates. But I don't
> think it really a problem.
> Example: to encode the claim with "P31: Q5" main snak and qualifiers "P42:
> Q42, P42: Q44" we would have a vector V such that V[P31_offset + Q5_offset]
> = 1, V[qualifiers_offset + P42_offset + Q42_offset] = 1 and
> V[qualifiers_offset + P42_offset + Q44_offset] = 1 and 0 elsewhere.
>
> I am not sure how to encode statements references (merge all of them and
> encode it just like the qualifiers vector is maybe a first step but is bad
> if we have multiple references).  For the rank you just need 3 booleans
> is_preferred, is_normal and is_deprecated.
>
> Cheers,
>
> Thomas
>
> [1] https://www.wikidata.org/wiki/Wikidata:Glossary
>
>
> > Le 27 sept. 2017 à 12:41, John Erling Blad <[email protected]> a écrit :
> >
> > Is there anyone that has done any work on how to encode statements as
> features for neural nets? I'm mostly interested in sparse encoders for
> online training of live networks.
> >
> >
> > _______________________________________________
> > Wikidata mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to