PHP/ir

Information Retrieval and other interesting topics

PageRank In PHP

In: search

12 Dec 2009

Google was a better search engine than it's predecessors for a number of reasons, but probably the most well known one is PageRank, the algorithm for measuring the importance of a page based on what links to it. Though not necessarily that useful on its own, this kind of link analysis can be very helpful as part of a general information retrieval system, or when looking at any kind of network, such as a friend graph from a social network.

Text Generation

In: language

12 Dec 2009

After a rather technical post last week, something a bit lighter. Text and language generation is a fun topic with applications that run from randomly generating scientific papers for conferences, to the practical tasks of generating speech and automated responses. In this post we'll look at how we can generate some nonsense text based on existing documents, which isn't on the overly practical side, though it can make a fun change from Lorem Ipsum for holding copy. The code is throughout, but you can also grab the lot in a zip.

Support Vector Machines In PHP

In: classification, svm, vector space

12 Dec 2009

When it comes to classification, and machine learning in general, at the head of the pack there's often a Support Vector Machine based method. In this post we'll look at what SVMs do and how they work, and as usual there's a some example code. However, even a simple PHP only SVM implementation is a little bit long, so this time the complete source is available separately in a zip file.

Part Of Speech Tagging

In: language

11 Nov 2009

Until now, all the posts here have looked at text in a purely statistical way. What the words actually were was less important than how common they were, and whether they occurred in a query or a category. There are plenty of applications, however, where a deeper parsing of the text could be huge beneficial, and the first step in such parsing is often part of speech tagging.

Language Detection With N-Grams

In: vector space, language

11 Nov 2009

So far when we've been looking at text we've been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications.