This meetup included an extensive Text Mining in R session with an Introduction to tm by Ingo Feinerer and a talk about Text Mining with Hadoop by Stefan Theussl.
After a creative break for the last month Ingo and Stefan gave great talks covering tm in greater detail after the brief introduction in February.
Ingo Feinerer: Introduction to tm
Ingo started right away with a nice bottom-up introduction covering tm’s building blocks like Sources, Readers and Corpora. The creation of Document-TermMatrices was also motivated with a small clustering example for 3 documents.
The word cloud shown above was created from the tm package vignette as follows:
library(tm) library(wordcloud) uri <- sprintf("file://%s", system.file(file.path("doc", "tm.pdf"), package = "tm")) stopifnot(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) corp <- Corpus(URISource(uri), readerControl = list(reader = readPDF)) tmvignette <- paste(content(corp[]), collapse = "\n") vigclean <- stripWhitespace(removePunctuation(removeNumbers(tmvignette))) vigclean <- removeWords(vigclean, stopwords()) wordcloud(vigclean)
Stefan Theussl: Text Mining with Hadoop
Stefan gave a solution to the problem when things (i.e. text corpora) get big using a set of Hadoop R-packages he created in collaboration with Ingo.