A lot of interesting techniques involve taking statistical samples, and using those to predict what we'll see in the future. Usually this works pretty well, but when we're dealing with a lot of options or if we have some options that are very rare that approach can go pretty wrong. If we go down the street and note down how many men and women we see, we'll probably be able to use that to predict the chance of the next person we see being male or female pretty well. However, if we were counting all the species of animals we … Read More »
Peter Norvig's spelling corrector is an interesting example of using some statistical techniques for the very practical purpose of spelling correction, inspired by a conversation on the Google 'Did You Mean' spelling suggestion functionality. There's an excellent explanation of the background in his article, so I'll skim over the ideas and how you might implement them in PHP. Read More »
Determining whether two documents are exactly the same is pretty easy, just use some suitably sized hash and look for a match. A document will generally only hash to the same as another though if they are identical - the smallest change, or the same content on another site with a different header and footer, for example, will cause the hash to be quite different. These near duplicates are very common, and being able to detect them can be useful in a whole range of situations. Shingling is one process for relatively cheaply detecting these duplicates. Read More »
Taking a string and separating it into tokens is one of those smaller problems in search that seems initially simple - split on spaces - but can quickly become overwhelmed with edge cases. Ignoring the problem of other languages, some of which don't even necessarily use a space, the exceptions tend to fall into two categories, punctuation related and normalisation. Read More »
A site about search, text categorisation, clustering and other interesting topics relevant to the web, but not often covered for PHP developers.