PHP/ir

Information Retrieval and other interesting topics

Tries And Wildcards

In: datastructures

09 Sep 2009

One nice bit of search query functionality, particularly in boolean systems, is the wildcard match. If you aren't sure whether the title you're trying to remember contains the word academy, academic, academically, or academics then you might be well served by trying all four: academ*.

Simple Search: Phrases

In: search

09 Sep 2009

In an earlier post we looked at a simple search system that could handle straightforward boolean combinations of words in a query. Much of the time we can treat even 'natural' searches like that, assuming that a search like php information retrieval is "look for any document containing the words php AND information AND retrieval", but sometimes the user is searching for that specific phrase in that specific order.

Simple Search: The Vector Space Model

In: search

09 Sep 2009

One of the issues with the boolean search model is that results are unranked - every matching document for a query contains all of the terms in that query, and there's no real way of saying which are 'better'. However, if we could weight the terms in a document based on how representative they were of the document as a whole, we could order our results by the ones that were the best match for the query. This is the idea that forms the basis for the vector space model.

K-Means Clustering

In: clustering

09 Sep 2009

My friend Vincenzo recently posted up a review of academic work on clustering that he compiled while working at the University of Naples. It's worth a look if you're interested in the field, going from the basic methods all the way up to the latest techniques like Support Vector Clustering (which I believe you can read about in Enzo's masters thesis).

Tokenisation

In: special, interest

09 Sep 2009

Taking a string and separating it into tokens is one of those smaller problems in search that seems initially simple - split on spaces - but can quickly become overwhelmed with edge cases. Ignoring the problem of other languages, some of which don't even necessarily use a space, the exceptions tend to fall into two categories, punctuation related and normalisation.