Iv'e been coding my own text predictors and researching them hard and I notice
the key is frequencies of what usually comes next after your context [th] ex.
you see [e] has high count observations. But you want the past window to be
very long and grab your entail frequencies from those
It allows it to consider/ recognize very very long unseen context by using
something like word2vec to find context matches that aren't exact and by
focusing on important words that are more rare, and then to stretch even longer
and become more accurate as well it boosts candidate predictions