One of the issues with the boolean search model is that results are unranked - every matching document for a query contains all of the terms in that query, and there's no real way of saying which are 'better'. However, if we could weight the terms in a document based on how representative they were of the document as a whole, we could order our results by the ones that were the best match for the query. This is the idea that forms the basis for the vector space model. Read More »
Memory isn't something that we have to worry about very much in PHP, as memory management is handled for us by the Zend engine. However, when it does become an issue it becomes a very big one - most PHP script are limited as to how much memory they can consume. While this makes a lot of sense for web processes, and is in general not a problem, when you have a lot of data to deal with it can make life difficult. Read More »
One nice bit of search query functionality, particularly in boolean systems, is the wildcard match. If you aren't sure whether the title you're trying to remember contains the word academy, academic, academically, or academics then you might be well served by trying all four: academ*. Read More »
In an earlier post we looked at a simple search system that could handle straightforward boolean combinations of words in a query. Much of the time we can treat even 'natural' searches like that, assuming that a search like php information retrieval is "look for any document containing the words php AND information AND retrieval", but sometimes the user is searching for that specific phrase in that specific order. Read More »
My friend Vincenzo recently posted up a review of academic work on clustering that he compiled while working at the University of Naples. It's worth a look if you're interested in the field, going from the basic methods all the way up to the latest techniques like Support Vector Clustering (which I believe you can read about in Enzo's masters thesis). Read More »
A site about search, text categorisation, clustering and other interesting topics relevant to the web, but not often covered for PHP developers.