Re: [Wikidata] Encoders/feature extractors for neural nets

John Erling Blad Mon, 02 Oct 2017 20:50:10 -0700

You might view (my) problem as an embedding for words (and its fragments)
driven by valued statements (those you discard), and then inverting this
(learned encoder) into a language model. Thus when describing an object it
would be possible to chose better words (lexical choice in natural language
generation).


On Mon, Oct 2, 2017 at 5:00 PM, <[email protected]> wrote:

> I have done some work on converting Wikidata items and properties to a
> low-dimensional representation (graph embedding).
>
> A webservice with a "most-similar" functionality based on computation in
> the low-dimensional space is running from https://tools.wmflabs.org/wemb
> edder/most-similar/
>
> A query may look like:
>
> https://tools.wmflabs.org/wembedder/most-similar/Q20#language=en
>
> It is based on a simple Gensim model https://github.com/fnielsen/wembedder
> and could probably be improved.
>
> It is described in http://www2.imm.dtu.dk/pubdb/v
> iews/edoc_download.php/7011/pdf/imm7011.pdf
>
> It is not embedding statements but rather individual items.
>
>
> There is general research on graph embedding. I have added some of the
> scientific articles to Wikidata. You can see them with Scholia:
>
> https://tools.wmflabs.org/scholia/topic/Q32081746
>
>
> best regards
> Finn Årup Nielsen
> http://people.compute.dtu.dk/faan/
>
>
> On 09/27/2017 02:14 PM, John Erling Blad wrote:
>
>> The most important thing for my problem would be to encode quantity and
>> geopos. The test case is lake sizes to encode proper localized descriptions.
>>
>> Unless someone already have a working solution I would encode this as
>> sparse logarithmic vectors, probably also with log of pairwise differences.
>>
>> Encoding of qualifiers is interesting, but would require encoding of a
>> topic map, and that adds an additional layer of complexity.
>>
>> How to encode the values are not so much the problem, but avoiding
>> reimplementing this yet another time… ;)
>>
>> On Wed, Sep 27, 2017 at 1:23 PM, Thomas Pellissier Tanon <
>> [email protected] <mailto:[email protected]>> wrote:
>>
>>     Just an idea of a very sparse but hopefully not so bad encoding (I
>>     have not actually tested it).
>>
>>     NB: I am going to use a lot the terms defined in the glossary [1].
>>
>>     A value could be encoded by a vector:
>>     - for entity ids it is a vector V that have the dimension of the
>>     number of existing entities such that V[q] = 1 if, and only if, it
>>     is the entity q and V[q] = 0 if not.
>>     - for time : a vector with year, month, day, hours, minutes,
>>     seconds, is_precision_year, is_precision_month, ..., is_gregorian,
>>     is_julian (or something similar)
>>     - for geo coordinates latitude, longitude, is_earth, is_moon...
>>     - string/language strings: an encoding depending on your use case
>>     ...
>>     Example : To encode "Q2" you would have the vector {0,1,0....}
>>     To encode the year 2000 you would have {2000,0...,
>>     is_precision_decade =
>>     0,is_precision_year=1,is_precision_month=0,...,is_gregorian=true,...}
>>
>>     To encode a snak you build a big vector by concatenating the vector
>>     of the value if it is P1, if it is P2... (you use the property
>>     datatype to pick a good vector shape) + you add two cells per
>>     property to encode is_novalue, is_somevalue. To encode "P31: Q5" you
>>     would have a vector V = {0,....,0,0,0,0,1,0,....} with 1 only for
>>  V[P31_offset + Q5_offset]
>>
>>     To encode a claim you could concatenate the main snak vector + the
>>     qualifiers vectors that is the merge of the snak vector for all
>>     qualifiers (i.e. you build the vector for all snak and you sum them)
>>     such that the qualifier vectors encode all qualifiers at the same
>>     time. it allows to check that a qualifiers is set just by picking
>>     the right cell in the vector. But it will do bad things if there are
>>     two qualifiers with the same property and having a datatype like
>>     time or geocoordinates. But I don't think it really a problem.
>>     Example: to encode the claim with "P31: Q5" main snak and qualifiers
>>     "P42: Q42, P42: Q44" we would have a vector V such that V[P31_offset
>>     + Q5_offset] = 1, V[qualifiers_offset + P42_offset + Q42_offset] = 1
>>     and V[qualifiers_offset + P42_offset + Q44_offset] = 1 and 0
>> elsewhere.
>>
>>     I am not sure how to encode statements references (merge all of them
>>     and encode it just like the qualifiers vector is maybe a first step
>>     but is bad if we have multiple references).  For the rank you just
>>     need 3 booleans is_preferred, is_normal and is_deprecated.
>>
>>     Cheers,
>>
>>     Thomas
>>
>>     [1] https://www.wikidata.org/wiki/Wikidata:Glossary
>>     <https://www.wikidata.org/wiki/Wikidata:Glossary>
>>
>>
>>     > Le 27 sept. 2017 à 12:41, John Erling Blad <[email protected]
>> <mailto:[email protected]>> a écrit :
>>     >
>>     > Is there anyone that has done any work on how to encode statements
>> as features for neural nets? I'm mostly interested in sparse encoders for
>> online training of live networks.
>>     >
>>     >
>>      > _______________________________________________
>>      > Wikidata mailing list
>>      > [email protected] <mailto:[email protected]>
>>      > https://lists.wikimedia.org/mailman/listinfo/wikidata
>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata>
>>
>>
>>     _______________________________________________
>>     Wikidata mailing list
>>     [email protected] <mailto:[email protected]>
>>     https://lists.wikimedia.org/mailman/listinfo/wikidata
>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata>
>>
>>
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Encoders/feature extractors for neural nets

Reply via email to