Automatic cleaning of messy text data

[UPDATE: I just wrapped AspellCheck() in a llply/laply wrapper from Hadley Wickham’s `plyr` package, so now it can be run on a vector of texts as well as a single character string, and it now has a default progress bar (set progress = “none” to turn it off). But you have to have plyr loaded to use the AspellCheck() function now.]

I deal with a lot of text data, and in R, the basic, general-purpose suite of tools for analyzing text data is the `tm` (text mining) package. I like the tm package a lot – it provides some convenient methods for pulling data from lots of different formats into a single corpus of texts, and it uses the sparseMatrix() function from the `Matrix`  package to allow comparisons of very large numbers of terms across very large numbers of documents without eating up very large amounts of resources. Continue reading