You might view (my) problem as an embedding for words (and its fragments) driven by valued statements (those you discard), and then inverting this (learned encoder) into a language model. Thus when describing an object it would be possible to chose better words (lexical choice in natural language generation).
On Mon, Oct 2, 2017 at 5:00 PM, <[email protected]> wrote: > I have done some work on converting Wikidata items and properties to a > low-dimensional representation (graph embedding). > > A webservice with a "most-similar" functionality based on computation in > the low-dimensional space is running from https://tools.wmflabs.org/wemb > edder/most-similar/ > > A query may look like: > > https://tools.wmflabs.org/wembedder/most-similar/Q20#language=en > > It is based on a simple Gensim model https://github.com/fnielsen/wembedder > and could probably be improved. > > It is described in http://www2.imm.dtu.dk/pubdb/v > iews/edoc_download.php/7011/pdf/imm7011.pdf > > It is not embedding statements but rather individual items. > > > There is general research on graph embedding. I have added some of the > scientific articles to Wikidata. You can see them with Scholia: > > https://tools.wmflabs.org/scholia/topic/Q32081746 > > > best regards > Finn Årup Nielsen > http://people.compute.dtu.dk/faan/ > > > On 09/27/2017 02:14 PM, John Erling Blad wrote: > >> The most important thing for my problem would be to encode quantity and >> geopos. The test case is lake sizes to encode proper localized descriptions. >> >> Unless someone already have a working solution I would encode this as >> sparse logarithmic vectors, probably also with log of pairwise differences. >> >> Encoding of qualifiers is interesting, but would require encoding of a >> topic map, and that adds an additional layer of complexity. >> >> How to encode the values are not so much the problem, but avoiding >> reimplementing this yet another time… ;) >> >> On Wed, Sep 27, 2017 at 1:23 PM, Thomas Pellissier Tanon < >> [email protected] <mailto:[email protected]>> wrote: >> >> Just an idea of a very sparse but hopefully not so bad encoding (I >> have not actually tested it). >> >> NB: I am going to use a lot the terms defined in the glossary [1]. >> >> A value could be encoded by a vector: >> - for entity ids it is a vector V that have the dimension of the >> number of existing entities such that V[q] = 1 if, and only if, it >> is the entity q and V[q] = 0 if not. >> - for time : a vector with year, month, day, hours, minutes, >> seconds, is_precision_year, is_precision_month, ..., is_gregorian, >> is_julian (or something similar) >> - for geo coordinates latitude, longitude, is_earth, is_moon... >> - string/language strings: an encoding depending on your use case >> ... >> Example : To encode "Q2" you would have the vector {0,1,0....} >> To encode the year 2000 you would have {2000,0..., >> is_precision_decade = >> 0,is_precision_year=1,is_precision_month=0,...,is_gregorian=true,...} >> >> To encode a snak you build a big vector by concatenating the vector >> of the value if it is P1, if it is P2... (you use the property >> datatype to pick a good vector shape) + you add two cells per >> property to encode is_novalue, is_somevalue. To encode "P31: Q5" you >> would have a vector V = {0,....,0,0,0,0,1,0,....} with 1 only for >> V[P31_offset + Q5_offset] >> >> To encode a claim you could concatenate the main snak vector + the >> qualifiers vectors that is the merge of the snak vector for all >> qualifiers (i.e. you build the vector for all snak and you sum them) >> such that the qualifier vectors encode all qualifiers at the same >> time. it allows to check that a qualifiers is set just by picking >> the right cell in the vector. But it will do bad things if there are >> two qualifiers with the same property and having a datatype like >> time or geocoordinates. But I don't think it really a problem. >> Example: to encode the claim with "P31: Q5" main snak and qualifiers >> "P42: Q42, P42: Q44" we would have a vector V such that V[P31_offset >> + Q5_offset] = 1, V[qualifiers_offset + P42_offset + Q42_offset] = 1 >> and V[qualifiers_offset + P42_offset + Q44_offset] = 1 and 0 >> elsewhere. >> >> I am not sure how to encode statements references (merge all of them >> and encode it just like the qualifiers vector is maybe a first step >> but is bad if we have multiple references). For the rank you just >> need 3 booleans is_preferred, is_normal and is_deprecated. >> >> Cheers, >> >> Thomas >> >> [1] https://www.wikidata.org/wiki/Wikidata:Glossary >> <https://www.wikidata.org/wiki/Wikidata:Glossary> >> >> >> > Le 27 sept. 2017 à 12:41, John Erling Blad <[email protected] >> <mailto:[email protected]>> a écrit : >> > >> > Is there anyone that has done any work on how to encode statements >> as features for neural nets? I'm mostly interested in sparse encoders for >> online training of live networks. >> > >> > >> > _______________________________________________ >> > Wikidata mailing list >> > [email protected] <mailto:[email protected]> >> > https://lists.wikimedia.org/mailman/listinfo/wikidata >> <https://lists.wikimedia.org/mailman/listinfo/wikidata> >> >> >> _______________________________________________ >> Wikidata mailing list >> [email protected] <mailto:[email protected]> >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> <https://lists.wikimedia.org/mailman/listinfo/wikidata> >> >> >> >> >> _______________________________________________ >> Wikidata mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> >> > _______________________________________________ > Wikidata mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata >
_______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
