In one of his talks at QCon, John Allspaw mentioned using Holt-Winter exponential smoothing on various monitoring instances. Wikipedia has a good entry on the subject, of course, but the basic idea is to take a noisy/spikey time series and smooth it out, so that unexpected changes will stand out even more. That's often initially done by taking a moving average, so say averaging the last 7 days of data and using that as the current day's value. More complicated schemes weight that average, so that the older data contributes less.
A lot of interesting techniques involve taking statistical samples, and using those to predict what we'll see in the future. Usually this works pretty well, but when we're dealing with a lot of options or if we have some options that are very rare that approach can go pretty wrong. If we go down the street and note down how many men and women we see, we'll probably be able to use that to predict the chance of the next person we see being male or female pretty well. However, if we were counting all the species of animals we encounter, and trying to use that to predict what we'll see in the future, we'd likely run in to a couple of problems.
In the last post we had a simple stepping algorithm, and a gradient descent implementation, for fitting a line to a set of points with one variable and one 'outcome'. As I mentioned though, it's fairly straightforward to extend that to multiple variables, and even to curves, rather than just straight lines.
I've had a couple of emails recently about the excellent Stanford Machine Learning and AI online classes, so I thought I'd put up the odd post or two on some of the techniques they cover, and what they might look like in PHP.
Benfords Law is not an exciting new John Nettles based detective show, but an interesting observation about the distribution of the first digit in sets of numbers originating from various processes. It says, roughly, that in a big collection of data you should expect to see a number starting with 1 about 30% of the time, but starting with 9 only about 5% of the time. Precisely, the proportion for a given digit can be worked out as: