Finding Data Anomalies You Didn't Know to Look For
Anomaly detection is the detective work of machine learning: finding the unusual, catching the fraud, discovering strange activity in large and complex datasets. But, unlike Sherlock Holmes, you may not know what the puzzle is, much less what "suspects" you're looking for. This O'Reilly report uses practical examples to explain how the underlying concepts of anomaly detection work.
From banking security to natural sciences, medicine, and marketing, anomaly detection has many useful applications in this age of big data. And the search for anomalies will intensify once the Internet of Things spawns even more new types of data. The concepts described in this report will help you tackle anomaly detection in your own project.
Use probabilistic models to predict what's normal and contrast that to what you observe Set an adaptive threshold to determine which data falls outside of the normal range, using the t-digest algorithm Establish normal fluctuations in complex systems and signals (such as an EKG) with a more adaptive probablistic model Use historical data to discover anomalies in sporadic event streams, such as web traffic Learn how to use deviations in expected behavior to trigger fraud alerts
A short introduction to anomaly detection that touches on basically the whole range and uses examples to demonstrate the points. It is not uselessly verbose, but I would still prefer to have a more structured approach and have some bullet-points at the end of each chapter. The author also provides GitHub repositories for those who are interested in the code that was used for the examples, but the text itself is programming language independent.
Pretty light on technical and mathematical detail, overall. For the first few chapters, the book just comes across like a long advertorial for t-digest (which, I hasten to add, isn't explained in any real detail - it just gets introduced at the beginning of Chapter 3 as a "there's this algorithm, use it" type thing). The first chapters also give quite an overly-lengthy introduction to some basic concepts in time-series analysis.
The fourth chapter is a lot more interesting, as it discusses using a streaming clustering algorithm to a decompose a complex signal - but again, it's missing detail, which is a genuine shame. This lack of detail leaves me with more questions than answers - e.g. how is a signal reconstructed using the output of the streaming K-Mean? What representation is used for the windowed signal data to input into the K-Means algorithm?
Overall, this book would have been better if it was about half the length and written as a blog post... or if it were two or three times longer and included implementation details. To finish on a positive note, it did give a nice useful introduction to some ideas that I plan to take and apply to other real-world scenarios - but it was just that - an introduction. I must add - the author has included a set of nice examples on his GitHub account which contain example code and data to go along with the explanations. This helps a lot to solidify several of the concepts introduced in the later chapters, and makes the lack of detail in the book a little bit less troublesome.
This book has some good ideas about Anomaly detecting such as using t-digest for setting thresholds, using deep learning and auto encoders for detecting anomalies. It also explains the concept of seasonality and the usage of reconstruction and diffing with past patterns quite well.
However it leaves one unsated on the technical front. There are links to further studies but many of the concepts are not completely fleshed out. A lot of sections seems quite repetitive as well.
For those familiar with statistics, there won't be any surprises here: some concepts and names will be introduced but it will feel like a natural step forward through the same path. For those without statistics, the book can still be useful as everything should be easy to follow, but learning about statistics will likely help moving forward.
The title "A new look at anomaly detection" signals that the book will provide some new and interesting insights in regard to anomaly detection. It did not. There is far better material on the subject out there. If you're a somewhat technical person, this is likely not a good book for you.
Good overview on anomaly detection techniques (could have been even shorter). Gives me a lot to think about using some of these techniques at work. We already use one of the author’s t-digest algorithm (a custom implementation in C) for a different use case, woot!
Only rough explanations were given with important details missing. No references to other literature! The link to the last example source code contains an empty repository.
Excellent high level summary. Not much math and certainly no in-depth exploration but a great little introduction to some of the basics of machine learning for someone new to the field
An overall decent preview of new techniques for systems to tune them to better recognize anomalies among all the data points being introduced in the modern era.
The downside of ebooks is that it not immediately clear that a book has only 66 pages of content, where you would expect a book with a similar title to at least occupy you for more than a afternoon's read. The book has some nice overviews of anomaly detection, but obviously doesn't really dive into the matter. The book als seems a bit constructed around this 't-Digest', which was thought up by the author Ted Dunning. To be fair. In my limited capacity the t-Digest seems like a very good way to estimate medians on distributed / streaming data. However it would have been more 'fair' to name the book: 't-Digest: Using machine learning to estimate streaming data statistics' (or something a like) Which would't have gotten my hope up of actually learning more (then 1 thing) about anomaly detection.
This book is too short and too terse to be of much value to probably anyone. The chapter on t-digest doesn't actually tell you what t-digest does or how it works. The following chapters cover simple use-cases that don't extrapolate well to actual, real life machine learning problems. None of them, for example, would be able to detect the freaky-ass fish swimming the wrong way on the cover.