So far when we've been looking at text we've been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications. Read More »
A site about search, text categorisation, clustering and other interesting topics relevant to the web, but not often covered for PHP developers.