At work, I make language models. This is a little explanation of what a language model is.
"All models are wrong, but some are useful."
I'm not sure who said this, but it's a good thing to keep in mind.
Imagine you have a language and you want to be able to predict what words are likely to be strung together. You need to create a model, which is just a structured representation. This model will never be perfect, because it will always be incomplete and the way people use language is constantly evolving. Nevertheless, the idea is that if you get enough data (texts, transcripts etc.) and you represent it intelligently, you'll be able to make a useful model.
An ngram model is a model where you look at a window of n words. The most basic model would simply count ngrams. Lets say we want to use a bigram (a 2gram, where the window is 2 words).
We will train the model based on this is our very very short input text:
The quick fox jumped over the candlestick. The fox then melted down the candlestick and jumped over the moon.
If we just iterate through the text using a window of two, we get the following bigrams:
the quick
quick fox
fox jumped
jumped over
over the
the candlestick
candlestick .
. the
the fox
fox then
then melted
melted down
down the
the candlestick
candlestick and
and jumped
jumped over
over the
the moon
moon .
You'll notice some of them occur more than once. So we'll add counts to them like this:
1 the quick
1 quick fox
1 fox jumped
2 jumped over
2 over the
2 the candlestick
1 candlestick .
1 . the
1 the fox
1 fox then
1 then melted
1 melted down
1 down the
1 candlestick and
1 and jumped
1 the moon
1 moon .
This is a language model, albeit not a good one because it's not based on enough data.
Now, imagine you're looking over someone's shoulder as she types. If she types "the", based on this model you'd predict that the next word would be "quick", "candlestick", "fox", or "moon." Because "the candlestick" occurred twice while the other combinations only occurred once, you'd probably guess "candlestick" if you had to pick just one. Based on this model, and the much more accurate model in your brain, you wouldn't predict "the and" or "the then" or "the the."
Now you know what an ngram is. This is a fun Google site where you can see the frequency over time of different ngrams in models created from books.
For example:
trigram "thou art my"
trigram "never say never"
Finally I have something concrete to explain what you do to friends...Mom
ReplyDelete