Re: [Wikimedia-l] If we had proper language models…

2019-05-19 Thread John Erling Blad
Perhaps I'll explain this a bit better…

Words can be converted into a vector representation by a word2vec
algorithm [1]. After conversion words will be a point in a high
dimensional space. Relations between words will then be vectors
between such points. Similar relations (or related relations) can be
found by operations on such vectors, or sets of vectors. Often this is
visualized as queen is to king as woman is to man, and similar
relations.

Some relations is quite obvious and common, but some relations simply
does not exist. If we can make a probability model over relations (a
regression model) then we can estimate the probability of observing a
specific relation, and thus be able to say "this does not seem to be a
probable word". (Typically one of several sequence models ("Recurrent
neural network" [2]) would be used for the estimation, and triplet
loss [3] for the training phase.)

It would be like having a "spell right"-metric for text fragments.

Note that this isn't quite as easy as described, as words might have
multiple interpretations and that makes it difficult to build a stable
vector representation. An example is "car" which is something you
typically drive on a road, but it can also be part of a train, or a
toy.

[1] https://en.wikipedia.org/wiki/Word2vec
[2] https://en.wikipedia.org/wiki/Recurrent_neural_network
[3] https://en.wikipedia.org/wiki/Triplet_loss

On Sun, May 19, 2019 at 2:55 PM John Erling Blad  wrote:
>
> Microsoft has unveiled an idea about a grammar and style tool for
> Word. [1] I proposed something similar for detecting problematic
> grammatical constructs in the content translation tools.[2] It is a
> couple of years ago now, and I closed the task.
>
> [1] 
> https://venturebeat.com/2019/05/06/microsoft-debuts-ideas-in-word-a-grammar-and-style-suggestions-tool-powered-by-ai/
> [2] https://phabricator.wikimedia.org/T162525

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


[Wikimedia-l] If we had proper language models…

2019-05-19 Thread John Erling Blad
Microsoft has unveiled an idea about a grammar and style tool for
Word. [1] I proposed something similar for detecting problematic
grammatical constructs in the content translation tools.[2] It is a
couple of years ago now, and I closed the task.

[1] 
https://venturebeat.com/2019/05/06/microsoft-debuts-ideas-in-word-a-grammar-and-style-suggestions-tool-powered-by-ai/
[2] https://phabricator.wikimedia.org/T162525

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,