A lot of interesting techniques involve taking statistical samples, and using those to predict what we'll see in the future. Usually this works pretty well, but when we're dealing with a lot of options or if we have some options that are very rare that approach can go pretty wrong. If we go down the street and note down how many men and women we see, we'll probably be able to use that to predict the chance of the next person we see being male or female pretty well. However, if we were counting all the species of animals we encounter, and trying to use that to predict what we'll see in the future, we'd likely run in to a couple of problems.
One, we wouldn't have any way of reserving any of our expectation for seeing new animals, but it is totally possible that we will encounter an animal we hadn't seen in our first sample as there are a lot of different types. Two, we'll probably over-estimate the probability of seeing a more unusual type of animals if we did happen to encounter one in our first sample. The same problems exist if we try and calculate the probability of seeing different words we have found from a certain set of documents, which can be a problem in search and machine learning.
Alan Turing and IJ Good came up with some useful methods for estimating these kind of rare and unseen events. This includes a measure for how much of the population is covered by the sample: 1 - (n1 / N). N being the number of observations in the sample, and n1 is the number of types that have only been seen once.
More usefully for us, we can calculate how often we should expect to see new types. Concretely, lets look at the words in some documents. We'll write a small function that tokenises a document and counts up all the times each word appears:
We can then calculate the proportion of new words we should expect in another document, as the number of words that appeared once divided by the total number of different words.
<?php $proportion = rwords(1, $words1) / count($words1); We can test this with a couple of handy textfiles from textfiles.com:
With those two files we overestimate some, but come up with a pretty decent guess for how many we see in the next file:
As part of calculating that, we referenced a function rwords. The rwords function just counts up the number of words that have been seen 'r' times ( in this case r = 1). If we say that the number of times we have seen a word is the frequency of that word on the collection of documents, then this is the frequency of the frequency.
This gets us a useful estimate for much of our probability we should reserve for potentially unknown words. That given, we also need to reduce our other probabilities slightly, to be slightly less certain of seeing each word that we have seen again - but how do we distribute that? Good-Turing gives us another straightforward method for calculating, with the addition of some helpful smoothing from Bill Gale at AT&T.
The smoothing is needed because we calculate the expected proportions based on the difference between the number of items with the same frequency as the word (nr), and the number with the next frequency up (nr+1). When those frequencies are high (for common words for example), there is a good chance of there not being any words which were seen r + 1 times. So, we use the Z value, which calculates a score based on the next available highest and lowest.
This function is a bit complex, so I'll break it down a little. First we turn our array into a sequential, numerically indexed one with array_values, so we know we'll definitely have values at index+1 and index-1, unless we hit the ends of the array. Next we find the index of our value of r - that is, a point in the array that matches the supplied frequency. If we can't find one, we track down to a real value.
After that we get the nr score with our rwords function. We then get the r values above (rh) and below (rl) the supplied one. If we're at the top of the array we use 2r - rl, and if at the bottom we use 0. Remember that nr is the number of words that were seen r times, but rl and rh will be r values themselves - the rarer the words the higher nr is likely to be (as there are usually many different unusual words), but the lower r will be, and vice versa.
Finally, we calculate the Z value itself: 2 nr / rh - rl. We calculate the proportion using this z value like: ( (r +1)*z(r+1)/z(r) ) / N:
This overall gives us a probability of seeing a word again which is scaled to take account of potentially new words: