Much of the data available today is unstructured and text-heavy, making it challenging for analysts to apply their usual data wrangling and visualization tools. With this practical book, youíll explore text-mining techniques with tidytext, a package that authors Julia Silge and David Robinson developed using the tidy principles behind R packages like ggraph and dplyr. Youíll learn how tidytext and other tidy tools in R can make text analysis easier and more effective. The authors demonstrate how treating text as data frames enables you to manipulate, summarize, and visualize characteristics of text. Youíll also learn how to integrate natural language processing (NLP) into effective workflows. Practical code examples and data explorations will help you generate real insights from literature, news, and social media.Learn how to apply the tidy text format to NLP Use sentiment analysis to mine the emotional content of text Identify a documentís most important terms with frequency measurements Explore relationships and connections between words with the ggraph and widyr packages Convert back and forth between Rís tidy and non-tidy text formats Use topic modeling to classify document collections into natural groups Examine case studies that compare Twitter archives, dig into NASA metadata, and analyze thousands of Usenet messages
Really comprehensive book about text mining with R and tidy. While it is understood that some R and tidy knowledge are required to work out the examples of the book, at around the TF/IDF chapter I started to feel that I was spending more time checking out google to see what that specific R function was doing, than to fully grasp the theoretical concepts applied to the cases. That made me lose interest and wanting to find other references. But I finally found the time to finish it and I have to say that all in all, this is a good book to see how to handle basic-to-medium use cases of text mining with R. I believe it may become a reference book for me when trying to work out my own datasets.
Great code examples! Easy to emulate, shows the necessary data cleaning and preprocessing and gives good tips for what to do in other contexts. You'll need to be already familiar with R and the dplyr package to get anything out of this book, though.
If you don't know R or dplyr and want to jump straight in to natural language processing, I'd instead recommend starting with the vignettes for the tm or quanteda packages.
It covers the basics (sentiment analysis, tf-idf, n-gram, topic modelling, and visualization) well and the chapters on case studies are pretty helpful. The use of literature (Jane Austen's novels and more) as data also makes it more engaging to a literary minded reader.
It's just when the author says "slightly familiar with dplyr and ggplot2" on the preface, she means she is not going to explain any codes relating to these two packages. Compared to all those annotated-line-by-line codes in other online tutorials, this book may not be that accessible to a beginner.
On topic modelling, you may want to google how to determine the number of topics as more systematic approaches to such determination are not covered.
Disclaimer: I am not an expert on text mining, but I do have ~8 years of data science experience.
This was a very nice introduction to doing it in R, and the examples were very interesting too. In general, I recommend books by these authors.
My only complaint is that they did not go into details about how they scraped Twitter posts. The API is quite annoying and limited, so one might have to do some regular webscraping. Guess I should read a textbook on that next.
I enjoyed working through the book but it is a bit dated at this point and has some areas that are not functioning due to outdated packages. At times I had to go to the website and then review what they had updated on website. Also there are times where they don’t have code set up for a user to actually execute it. For example, the code related to the twitter files were a bit confusing. I had to go to the github repository to actually download the data and this should have been explained since the book really is a mixture of coding and commentary.
Although this book is no longer the most up-to-date book on text mining. I really like a lot of the plots (ggplots) in this book, in particular for exploratory analysis. You can inspect the outputs at each stage, and visuals are great ways to make sense of the text data and communicate your findings. I will definitely reuse the plots in this book for further work.
The examples are interesting and very easy to follow. If you have any problem applying the techniques to your data set, just a quick search would lead you to the solutions!
Excellent coverage of taking a tidy approach to text analysis, with a generous number of worked examples. The one drawback is that much of the code used requires at least an intermediate-level working knowledge of R.
Good overview of the tidytext library in R. Note the end-all of text analyses, but a good place to begin. I now need to get something to do some analyses on...
Awesome book - with great step by step code to follow. The author's clearly explain analytical questions and walk through their analysis. Its so good I read it twice.