Massive modern datasets make traditional data structures and algorithms grind to a halt. This fun and practical guide introduces cutting-edge techniques that can reliably handle even the largest distributed datasets.In Algorithms and Data Structures for Massive Datasets you will Probabilistic sketching data structures for practical problems Choosing the right database engine for your application Evaluating and designing efficient on-disk data structures and algorithms Understanding the algorithmic trade-offs involved in massive-scale systems Deriving basic statistics from streaming data Correctly sampling streaming data Computing percentiles with limited space resources Algorithms and Data Structures for Massive Datasets reveals a toolbox of new methods that are perfect for handling modern big data applications. You’ll explore the novel data structures and algorithms that underpin Google, Facebook, and other enterprise applications that work with truly massive amounts of data. These effective techniques can be applied to any discipline, from finance to text analysis. Graphics, illustrations, and hands-on industry examples make complex ideas practical to implement in your projects—and there’s no mathematical proofs to puzzle over. Work through this one-of-a-kind guide, and you’ll find the sweet spot of saving space without sacrificing your data’s accuracy. About the technology Standard algorithms and data structures may become slow—or fail altogether—when applied to large distributed datasets. Choosing algorithms designed for big data saves time, increases accuracy, and reduces processing cost. This unique book distills cutting-edge research papers into practical techniques for sketching, streaming, and organizing massive datasets on-disk and in the cloud. About the book Algorithms and Data Structures for Massive Datasets introduces processing and analytics techniques for large distributed data. Packed with industry stories and entertaining illustrations, this friendly guide makes even complex concepts easy to understand. You’ll explore real-world examples as you learn to map powerful algorithms like Bloom filters, Count-min sketch, HyperLogLog, and LSM-trees to your own use cases. What's inside Probabilistic sketching data structures Choosing the right database engine Designing efficient on-disk data structures and algorithms Algorithmic tradeoffs in massive-scale systems Computing percentiles with limited space resources About the reader Examples in Python, R, and pseudocode. About the author Dzejla Medjedovic earned her PhD in the Applied Algorithms Lab at Stony Brook University, New York. Emin Tahirovic earned his PhD in biostatistics from University of Pennsylvania. Illustrator Ines Dedovic earned her PhD at the Institute for Imaging and Computer Vision at RWTH Aachen University, Germany. Table of Contents 1 Introduction PART 1 HASH-BASED SKETCHES 2 Review of hash tables and modern hashing 3 Approximate Bloom and quotient filters 4 Frequency estimation and count-min sketch 5 Cardinality estimation and HyperLogLog PART 2 REAL-TIME ANALYTICS 6 Streaming Bringing everything
Primarily because: * it IS a good book * it is a much-needed book for a rare (but fascinating) topic
Pros: 1. A lot of effort was put into clarifications - it is clear. There are pictures, examples, and even some formulas and math models. Respect for that. 2. The first part (hash-based data structures, e.g. HLL, Bloom filters) is stellar. 3. The third part contains some very rarely covered subjects - like dealing with external memory. Fascinating stuff.
Cons: 1. In fact, there are very few algorithms and structures covered. And even if there's a lot of effort put in, it required some additional "homework" (reaching out to other sources). 2. I didn't like the 2nd part - maybe because I had a relatively wide theoretical knowledge in that area. 3. Well, after all, there are very few people who will have the opportunity of putting this knowledge into actual use :) If you're aware of that, you may remove this point from the list of cons :)
After all, I had my fun. It was NOT an easy read, but definitely satisfying. In fact, I'd buy a tome 2 (if it existed).
This book in its initial chapter it is described as a subtle introduction. I think that it's too deep for an introduction and too shallow for a book discussing these topics. When reading I had this feeling that it might be more valuable to either read whitepapers or wiki pages describing discussed topics. If you need these data structures & algorithms, you must have read tens if not hundreds of them already and you don't care that much whether there's a nice description of something or not.
Part 1 is about hash-based sketches introduces bloom filters, hyper log log and count-min. It's a nicely written introducing the reader to the world of data structures that are not always right.
Part 2, Real-time analytics discusses applications based on streams, like digests, windows and more. I think that this is highly likely to be implemented by a 3rd party engine.
Part 3, Data structures for databases and external memory algorithms, dives into databases world. Again, it's too shallow to make it applicable (if you write a db) and providing only some ideas for newcomers.
Quite a dense read. This book is split into 3 different parts that focus on different aspects of solving problems with large datasets. The first part deals with structures like Bloom filters, count-min sketches, etc. If you've never used or seen any of these, the book does a good job of actually illustrating the concepts. The second part deals with sampling from data - this bit involves quite a bit of math and was difficult to follow for me. This may have been exacerbated because I was reading an Epub, but I still think looking at a lecture/video explanation would be easier to follow. The third part deals with storing and sorting data on disk - what are B-trees, LSM-trees, how do you do merge sort on a large file?
Overall, this is a pretty good and slightly unconventional read from what I have encountered in the software development book world. The illustrations are pretty helpful and really well done, the code snippets are good to go through, and it forces you to think about data in a slightly different way from what you might be used to. The 4/5 rating is because it is a bit long and difficult to read at times - going through all of the math analysis of runtime complexity as it is being explained in paragraph form can be quite tedious at times. Other than that, I highly recommend picking this up and trying to go through a chapter each day - going to be worth your time.
That's a different book - even if you are prepared for algorithms and data structures, you are used to complexity proofs, exactness, and precise answers. However, it's not that simple in the world of "big data" (or massive datasets and streaming, as the title suggests). If you look for faster answers, you seek help in probabilities and stochasticity.
The choice of data structures and algorithms is extensive (frequency, cardinality, streaming, etc.), and each chapter has a similar structure - with great illustrations for the examples and a practical use case at the end. Overall, I enjoyed the book, illustrations, and examples - even though it was not an easy read. However, I wish it had more examples, not only in Python.
A great book if you want to know more on probabilistic data structures, or know a bit, but want to deepen your knowledge. Main topic covered: - probabilistic data structures: Bloom filters, Count-min sketc, HyperLogLog - sampling of data - data structures and algorithms optimized for external memory: B-tree, LSM-tree, optimizing merge sort and quicksort.