This meetup included an extensive Text Mining in R session with an Introduction to tm by Ingo Feinerer and a talk about Text Mining with Hadoop by Stefan Theussl.

After a creative break for the last month Ingo and Stefan gave great talks covering tm in greater detail after the brief introduction in February.

Ingo Feinerer: Introduction to tm

Ingo started right away with a nice bottom-up introduction covering tm’s building blocks like Sources, Readers and Corpora. The creation of Document-TermMatrices was also motivated with a small clustering example for 3 documents.

CRAN Package Link Package Vignette

The word cloud shown above was created from the tm package vignette as follows:

uri <- sprintf("file://%s", system.file(file.path("doc", "tm.pdf"), package = "tm"))
stopifnot(all(file.exists(Sys.which(c("pdfinfo", "pdftotext")))))
corp <- Corpus(URISource(uri), readerControl = list(reader = readPDF))  
tmvignette <- paste(content(corp[[1]]), collapse = "\n")
vigclean <- stripWhitespace(removePunctuation(removeNumbers(tmvignette)))
vigclean <- removeWords(vigclean, stopwords())

Stefan Theussl: Text Mining with Hadoop

Stefan gave a solution to the problem when things (i.e. text corpora) get big using a set of Hadoop R-packages he created in collaboration with Ingo.

CRAN packages: