Simple Search: The Vector Space Model

17 Sep 2009

Spiral StairsOne of the issues with the boolean search model is that results are unranked - every matching document for a query contains all of the terms in that query, and there's no real way of saying which are 'better'. However, if we could weight the terms in a document based on how representative they were of the document as a whole, we could order our results by the ones that were the best match for the query. This is the idea that forms the basis for the vector space model. Read More »

Block Based External Sort

3 Sep 2009

49012397_1fbe7855e3_m.jpgMemory isn't something that we have to worry about very much in PHP, as memory management is handled for us by the Zend engine. However, when it does become an issue it becomes a very big one - most PHP script are limited as to how much memory they can consume. While this makes a lot of sense for web processes, and is in general not a problem, when you have a lot of data to deal with it can make life difficult. Read More »

Tries And Wildcards

27 Aug 2009

A different kind of tryOne nice bit of search query functionality, particularly in boolean systems, is the wildcard match. If you aren't sure whether the title you're trying to remember contains the word academy, academic, academically, or academics then you might be well served by trying all four: academ*. Read More »

Simple Search: Phrases

21 Aug 2009

PhrasesIn an earlier post we looked at a simple search system that could handle straightforward boolean combinations of words in a query. Much of the time we can treat even 'natural' searches like that, assuming that a search like php information retrieval is "look for any document containing the words php AND information AND retrieval", but sometimes the user is searching for that specific phrase in that specific order. Read More »

K-Means Clustering

14 Aug 2009

A clustered gall wasp, apparentlyMy friend Vincenzo recently posted up a review of academic work on clustering that he compiled while working at the University of Naples. It's worth a look if you're interested in the field, going from the basic methods all the way up to the latest techniques like Support Vector Clustering (which I believe you can read about in Enzo's masters thesis). Read More »

