PHP/ir

Information Retrieval and other interesting topics

Text Generation

In: language

12 Dec 2009

After a rather technical post last week, something a bit lighter. Text and language generation is a fun topic with applications that run from randomly generating scientific papers for conferences, to the practical tasks of generating speech and automated responses. In this post we'll look at how we can generate some nonsense text based on existing documents, which isn't on the overly practical side, though it can make a fun change from Lorem Ipsum for holding copy. The code is throughout, but you can also grab the lot in a zip.

Part Of Speech Tagging

In: language

11 Nov 2009

Until now, all the posts here have looked at text in a purely statistical way. What the words actually were was less important than how common they were, and whether they occurred in a query or a category. There are plenty of applications, however, where a deeper parsing of the text could be huge beneficial, and the first step in such parsing is often part of speech tagging.

Language Detection With N-Grams

In: vector space, language

11 Nov 2009

So far when we've been looking at text we've been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications.