Rate this book

Making Sense of Stream Processing

Name: Making Sense of Stream Processing
Rating: 4.25 (30 reviews)
ISBN: 9781491940105

Martin Kleppmann

Rate this book

GenresTechnologyProgrammingTechnicalComputer ScienceSoftwareArchitectureComputers

183 pages, ebook

Published March 4, 2016

13 people are currently reading

749 people want to read

About the author

Martin Kleppmann

6 books790 followers

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

80 (42%)

4 stars

80 (42%)

3 stars

23 (12%)

2 stars

5 (2%)

1 star

0 (0%)

Displaying 1 - 30 of 30 reviews

Vlad Ardelean

157 reviews34 followers

August 15, 2019

It's not that bad, but it really is not good.

I read in the last year Martin Kleppmann's book "Designing data-intensive application" (I'll refer to it as DDIA) which was a 5/5. I'll constantly be comparing this book with DDIA.

The structure of my review: I'll experiment with a section title+text format

*This book is far worse than DDIA*
This one however is definitely worse. Many people might think that maybe since that DDIA is such a long book, they might pick up this one, and still get some value out of it.
Well not so fast! If you're extremely new to the field of distributed systems, yes, you can pick this up safely as your first introduction. It will probably blow your mind.
If however you have worked in the field and read some other stuff, you'll see this book as full of wholes, mistakes, incomplete, unstructured and worst of all - I have not found a trace of talk about stream processing here! There are talks about caches, materialized views, linux pipes, A LOT about kafka, a lot of convincing that kafka is better than other technologies... and literally not even a page about stream processing.
I personally made extensive use or highlighting portions of the DDIA book. For this book I mostly marked the mistakes I found, to remember them. I won't be so arrogant as to consider myself smarter than the author. Still, after reading DDIA I got a good understanding of distributed systems and as such I found many clear points where "stuff just doesn't make sense" in this book. Perhaps more explanations were needed, but until I investigate these myself, I'll consider that the book has errors.

*Misleading title compared to the content*
The title is academic. You'd think that the author would explain "stream processing", give some examples, compare a few types of systems, list advantages and disadvantages. That's at least what I thought. No, he does not do that, at least not to an extent that I expected. What he does quite a lot is compare stream processing with more classic technologies, such as relational databases and linux pipes. He also just explains (quite repetitively) just one type of architecture: a message-queue based, stream processing. Some numbers to illustrate how repetitive this book is: the author takes a whole 20 pages to talk about linux pipes, and then a few extra pages to compare them to kafka as a message broker.

*Repetitive mistakes: log compaction when dealing with data corrupting bugs*
The author explains (over and over) that it's basically a good thing to think of every data modification as a message in an append-only log. So far so good, I think we can all agree that if we save all the modification in a log, we'll know how our data looked at any moment in time...well, if you're not dealing with a distributed system anyway, where you don't have the luxury of having a single log.
BUT THEN the author talks about this cool feature from kafka which well.... deletes all but the last message in the queue for you (I'm simplifying it a little, but please focus on the deletion).
If you delete your log data, then it seems to me that these problems arise:
1. You won't be able to reconstruct data at arbitrary points in time (maybe problematic for some people)
2. You basically turned a message queue into a half-message-queue, half-key-value-store hybrid (not problematic but true). If you did that however, you might as well just have used a database, with a single table, which has a primary key, the "content" and an "updated_at" field....which is comparable to just having a regular database in the first place.
3. You are NOT protected against data-corruption problems, because you will have overwritten your correct data with corrupted one

The thing is that the author does not even mention any of these points. You're left on your own, he just carries on and even repeats his ideas that: you absolutely need time-ordering, you are protected against data-corruption, and log compaction saves space... No, something is not right here, I'm sorry. I'll assume this is an error.

*RabbitMQ/JMS are not good because they do not have reliable message ordering*
I don't know about JMS, and please correct me if I'm wrong, but let me cite the RabbitMQ documentation:
"Section 4.7 of the AMQP 0-9-1 core specification explains the conditions under which ordering is guaranteed: messages published in one channel, passing through one exchange and one queue and one outgoing channel will be received in the same order that they were sent. RabbitMQ offers stronger guarantees since release 2.7.0." ( https://www.rabbitmq.com/semantics.ht... )
Ok, so not perfect, but the author also does some hand-waving on this aspect of ordering, saying at some point that you should process in a single thread messages from the queue.... well then, rabbitmq and I'm sure many other message brokers will guarantee message ordering, because it's trivial to do that if you process data in a single thread.
...but then...wait a second! If you were able to process data in a single thread, why would you even need to even care about stream processors and distributed systems? All would be fine if you could just run your entire stack on one server, and set your database's transaction isolation level to Serializable, right? Not sure what the author was thinking about when mentioning single threaded processing in distributed systems.

*Very little talk about disadvantages or failure modes"
I can't really say too much about this topic, because there isn't much in the book.
There's one case when the author talks about how twitter solves the problem of not creating duplicate user-names. Essentially:
1. put a username request in a queue
2. have some a single-threaded process pull the request and try to process it
3. that process would advance the message through the system, basically putting the request into an "ok" queue, or a "failed" queue.
BUT THEN
4. the author says that if it takes too much time, you just return an "OK" instantly, and then deal with cleaning up later.

Well doesn't this mean the problem is not solved unless in the most trivial case where your system is small enough?

One good thing I learned however, is that you can use things such as the username for routing messages to different kafka topics (or partitions?). If you have single-threaded-single-consumers of these topics, then you solved the username problem...well, only if the underlying database is not distributed, of course :P I was constantly under the impression that the author just keeps dumbing things down so they fit inside this ~180 page book. It's fair that you can't explain all the things in detail in 180 pages, but I'm asking myself why one would even attempt/pretend to?

*"A few hundred miliseconds with Samza"*
Yup, that's all the talk there is about Samza. Everything takes a few hundred milliseconds, so all your problems are solved. If the author spent 10+ percent of the book explaining trivial concepts such as linux pipes, and spends so much time repeating the easier topics about kafka, then why isn't there at least 1 page in this book dedicated to stream processors?
In DDIA I learnt about these scenarios of stream processing:
1. Stream-DB join
2. Stream-Stream join
3. Timing windows
...absolutely no mention of this topics in this book. Well it's good to know at least that you can sum all those up in the sentence "it takes a few hundred milliseconds". I am a little disappointed that in a book about stream processing, there's no talk about stream processing, only about messages and queues.

*Not much about CAP*
Not much at all. The author does quote an article from Nathan Marz: http://nathanmarz.com/blog/how-to-bea... ..where Nathan Marz tries to shove under the carpet the fact that his architecture produces a perpetually-inconsistent/eventually-almost-consistent system. I am not a fan of Nathan's Marz's writing. I consider him an evangelist/salesperson, not a serious author.

*This is a sales brochure, not a book about understanding stream processing*
confluent.io sponsored the author to write bottled water - a currently unmaintained library which was supposed to shove change events (inserts/updates/deletes) from Postgresql into kafka. The author mentions confluent's projects a lot in this book. Am I alone in thinking this is not appropriate in a book with such a clickbait-ish title? I wanted to understand stream processing, and not be continuously reminded about confluent and their projects. To be honest, I see they have this "confluent platform" which probably does stuff, but still I will not be surprised to discover that in this book the author simply described parts of confluent's solutions.

Conclusion:
This is a sales brochure, treating topics very superficially, and the title is clickbait. It goes into some subjects theoretically while in other parts it just presents superficial details about how you'd implement the single architecture style presented in the book. And materialized views...what do they have to do with anything? The whole caching chapter - it's ok, but I don't see it connected to the rest of the book.

ex-prio tech

Luke

1,069 reviews20 followers

May 12, 2017

A succinct argument for turning the web application database inside out to simplify modern convoluted development's many moving layers of indices, caches, and derived computed views of data throughout the stack. The author's solution and focus is the ordered immutable log, supporting simple normalized writes and loosely coupled re-playable pipelines populating all forms of derived data for reads.

Basically this book is the super short version of the path many of us have taken the hard way in webdev over the last 10 years. You can ignore the product-specific mentions in this (Kafka, Postgres, examples from Google/LinkedIn/Twitter) before and after reading with no loss of insight.

tech

Ahmad A.

78 reviews15 followers

January 27, 2018

One of the best books that one can read on the topic of Event/Stream Processing. A very good account of simplifying complex solutions to issues related to modern distributed systems without betraying the basic guarantees of data consistency and integrity. A very recommended read.

comp-sci

Minh Nhật

92 reviews49 followers

October 15, 2021

Martin Kleppmann lúc nào cũng làm mình ngạc nhiên với những diễn giải đơn giản cho những khái niệm phức tạp

Redowan Delowar

46 reviews4 followers

January 10, 2022

I was reading Jay Kreps' amazing piece on—making immutable logs the cornerstone of your application's architecture. I picked up this book not to understand the nitty-gritty of Kafka-driven workflow but to get a high-level overview of what's so novel about append-only logs and what kind of problems they promise to solve. Also, I was curious about why LinkedIn built Kafka and how it's different from the traditional AMQP solutions that have been around much longer. On that premise, the book didn't disappoint at all.

The book clarified the core concepts of Event Sourcing and CQRS for me with a single example and that alone made it worth my time. Also, I found the third chapter that deals with Change Data Capture (CDC) quite pragmatic. With the CDC framework, you can turn your existing relational database like Postgres into an event spewing system and then collect those events in Kafka. Afterward, any number of consumers can work on the messages stored in Kafka to construct a loosely couple fleet of services doing many different things. This is much easier to pull off compared to rearchitecting your entire application to make it event-driven.

Overall, I'd say this is worth your time if you're either new to the event-driven ecosystem or just want to learn more about the philosophy behind why a simple log structure can solve so many of the common pain points in distributed systems. However, you can comfortably skip it if you've already built multiple production-grade systems powered by Kafka-like tools or are only interested in the hardcore technical details.

engineering

Peter Rybarczyk

95 reviews9 followers

February 16, 2020

It's my first book from Martin Kleppmann and the first book about Event Streams so my review can be inadequate or a bit hyped, but...

I'd imagine a better beginner book about Stream processing but I didn't find any on the market. I think the best part of this book is the number of references to topics that each mid/senior engineer starts to think about. How databases work, Unix, software history, etc. it's worth reading just because of that but also this knowledge is used to describe more high-level problems like data integrity between systems and log-stream philosophy and use-cases. Ok, cursorily that's true, but for the guy who starts learning about that it's more than enough.

Conclusion
If you are looking for the book to understand Streams and their philosophy, it's should be enough to start. Read it first, then your future learning & life will be much easier. But, If you have already implemented a few stream systems... I assume that you can easily skip this book.

Sameer Rahmani

7 reviews9 followers

October 27, 2017

Another great book from Martin Kleppmann.

If you read about streams & stream processing before and you're familiar with logs and stuff like that, this book does not contain any new info for you. But since it's a short book, It can be a great review for you knowledge. One of the awesome things about this book and Martin Kleppmann in general is that he put lots of references for each chapter for the reader to expand his/her knowledge about the chapter by following to references.

Awesome book. Totally recommended.

distributed-software

Sudarshan

66 reviews14 followers

November 11, 2024

This was a great read which explained the philosophical underpinnings of the Stream processing movement. If you don’t have much experience with Apache Kafka, Azure EventHub etc then this is a pretty good book that you can finish in one sitting and get a good conceptual understanding of Event Streaming services.

This is a high level concept book and not an implementation book that goes deep into Kafka’s nuances. I can finally make sense of Apache Kafka now. Kudos to the author for explaining the concepts with such clarity.

Nuno Caro

2 reviews

January 7, 2020

Good introductory level book about what is making stream processing a big deal.

The author hovers over a good amount of different approaches for data management across history, explaining it's pros and cons finalizing by explaining where stream processing fits amongst them.

The book may lack some detail, but given it is meant to be entry level an it can be acquired for free in some websites, I would say give it a look.

Yifan Yang

45 reviews7 followers

January 28, 2025

This book serves as an excellent introduction to event-driven thinking and system design philosophy. It explores the shift from the traditional "aggregation" perspective to an event-stream-based approach, highlighting how this philosophy addresses data consistency and concurrency challenges. Additionally, it provides a brief overview of Change Data Capture (CDC) and Kafka Connect, illustrating their roles in facilitating data movement.

architecture

Everest Chen

4 reviews

December 24, 2021

An easy-to-follow high-level introductory book of why you want to use stream processing to build applications.
It could have more clarities of why implementing using stream is a better option in the last twitter example. If a user only needs request-and-response instead of a dynamic user interface, is there any obvious benefit than the "traditional" approach of implementation?

Bogdan

3 reviews

January 1, 2018

I read then used several concepts from the book at work.
It's very useful for better understanding additional use cases for Kafka topics and their respective consumers/processors.
I particularly liked the analogy Unix pipes and how producers/consumers can be loosely coupled.

Arun Ravindran

30 reviews2 followers

March 3, 2019

Short and clearly written guide to the software architecture transformation from traditional databases to Event sourcing. Most chapters show the advantages of a simple log based queues as a superior solution to many problems. Apache Kafka feature as the primary technology throughout the book.

Tomasz

3 reviews

April 26, 2020

Definitely a very informative book. Not as detailed as "Designing Data-Intensive applications" by the same author, but presents few interesting concepts. It also contains many references at the end of each chapter if you want to take a deep dive on a selected topic.

Amr

48 reviews13 followers

May 5, 2020

A good quick summary and introduction to reason about systems using stream processing. I worked with similar systems that used this pattern and the systems were resilient and durable because of that pattern, along with other things we had in place.

Recommend for a quick read.

Tharun

3 reviews

November 15, 2021

Very basic overview on stream processing. Few interesting perspectives like - similarities on replication and indexing. Since the concepts are majorly focused on LOG, the book could have provided more detailed information and implementation on it.

Adil Khashtamov

24 reviews1 follower

August 31, 2018

Series of blog posts converted into book. Interesting thoughts regarding stream processing and unix philosophy.

Jari Aarniala

13 reviews

October 29, 2018

A nice collection of essays from Martin Kleppman. The CDC chapter felt a little out of place

Tonmoy Chowdhury

3 reviews

February 5, 2019

Typical of Mr. Kleppmann, he has successfully broken down the notion of stream based architectures and made it very accessible.

before-march-2019 printed

Vara

1 review

December 27, 2019

Perfect approach to the road of stream processing.

Senjin Hajrulahovic

53 reviews

April 18, 2021

A light read. High leven explanations with lots of visualizations.

Mehdi Home

51 reviews12 followers

August 13, 2021

It has some good ideas, but I expected more details. I feel like it assumes everything's gonna be ok with the proposed solution which I'm very skeptical of this kind of optimism.

computer

Sachin Govind

36 reviews3 followers

January 18, 2022

The book could be a 3-4 blog posts instead of a whole book. disappointed considering his other books quality

Justin

3 reviews

March 7, 2017

Very basic overview of the core concepts and benefits of using Kafka. Some nice thought provoking discussion comparing databases and append-only logs. Many "illustrations" in the book were simply text that added 0 value and seem to have been added to make the book lengthier.

software-dev

Andrew Saul

139 reviews9 followers

September 10, 2016

Simply excellent. It's a free download from O'Reilly. I would've happily paid for it.
I bought Designing Data-Intensive Applications halfway through reading this.
Stream processing platforms are a real change from the traditional database world. They offer such a powerful set of abilities at such a low overhead.
They aren't applicable to every use case but given how good they are I wonder if the use cases they don't work with become mostly irrelevant anyway?
If you have any interest in this area I cannot recommend this more highly. There's heaps of resources here to keep on going with after you finish the book.

Ferhat

3 reviews

October 20, 2016

Nice coverage of many data(base) and distributed systems related concepts with clear, fluent explanation and great drawings. Especially, I love the comparison of data streams (esp Kafka) with Unix pipes. You should read this mini-book if you want to review your understanding of many computer science topics.

Alex Ott

Author 3 books207 followers

April 24, 2016

Between 3 & 4 - series of blog posts, converted into small book.

big-data

Daniel

Author 3 books37 followers

April 12, 2016

A good read, focussing of course on concepts of and solutions with Kafka. Not a lot of new content, though, if you have already read Martin's and Jay's blog posts.

Anirudh Mallem

12 reviews2 followers

June 15, 2016

An excellent starting material for people not familiar with the streaming technologies.

Bill Metangmo

5 reviews1 follower

April 1, 2017

Excellent book about :
- Event sourcing : how it works and how to implement it ?
- Messaging system : distributed log ones as Kakfa and queue one like AMQP or JMS
- Change Data Capture : stream data from databases : command statements to immutable events
- Unix philosophy: How & Why it work well and how can we learned from his design to make stream processing engines
- Database inside out : Learn more about database internals and how to use them to build efficient streaming platform
- Serialization: Avro/Thirft/ProtoBuff

Displaying 1 - 30 of 30 reviews