Data pipelines are the foundation for success in data analytics. Moving data from numerous diverse sources and transforming it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today's modern data stack. You'll learn common considerations and key decision points when implementing pipelines, such as batch versus streaming data ingestion and build versus buy. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions. You'll
I don't think that 'Pocket Reference' is the proper way to describe this book. An example? Sample path? One of the ways to do it? A representative case? A bit of theory, some SQL, basic introduction to how to structure processing pipeline - that's what you can get out of this book. It's probably OK if you want to figure out what actually powers (under the hood) modern data processing pipelines, but I wouldn't say it's useful if you want to set a solid foundation for a more thorough research.
I read this book to get up to speed with modern software data engineering. I think I achived the goal, although I finished with a knowledge of how much I do not know, rather than with the confidence in building the solutions myself. James seems to take an opinionated approach by using cloud warehouse databases (Redshift and Snowflake). The use cases and computations are well suited to them, and I would need to read other recourses to see how the patterns mentioned play with other technologies. The price/ops complexity of possible stacks is not mentioned. The chapters with SQL examples look great. I learned a bunch there. There are also enough mentions of various technologies and books throughout the book -- I learned about Kimball modeling, dbt, Airflow, Atlas... It would be great to extend the reasoning about production and operations, pitfalls and risks -- such as schema migration, scaling, schema registry, deployment, versioning, durability risks, retention, backups, recomputing... Validation, metrics collection, and slack notifications are presented and I would like to hear more about some visualization. Overall it is a good book and I only wish every chapter of it would be bigger. Oh wait, there is "Pocket" in the name. Nevermind, then.
Great overview of the ETL process, with examples. Doesn’t touch streaming but outside of that I have no complaints. Gets to the point and doesn’t delve TOO deep into the details.
I skimmed this and found a good introduction to a number of issues and perspectives, though as a "pocket reference" it's quite brief. Beginners like me will find some helpful ideas and opinions: those with more background aren't likely to find much. It's good for what it is, but don't expect a lot more.
This is a goood book for data engineers looking to work with CDC ETL & EtLT data moving systems. It covers a little orchestration management with Airflow too. Nice book for the desk of anyone working in the data industry.
Data Pipelines Pocket Reference by James Densmore is a handy guide for data engineers who already have some foundational knowledge and are looking for quick insights or inspiration. It’s not a beginner’s guide or a deep dive, but it does a good job of covering the essentials of designing and managing data pipelines in a concise format.
The examples included are practical, but they’re fairly basic, so if you’re looking for more in-depth explanations or complex use cases, this might not fully meet your needs. That said, it’s a great resource for brushing up on concepts or getting ideas for tackling specific pipeline-related challenges. Overall, a solid reference book, just as the title suggests.
The data engineering industry suffers from a lack of good books. This one is very practical and ELT-focused. It complements well theoretical books like Kimball's and Designing Data-Intensive Applications. It still far from being perfect though. Some parts are already outdated. The focus on ELT, without extensive discussion of its tradeoffs, is highly questionable.
Good read. Not great. For me it was a bit specific in code for me(i know how to write code, and don't use airflow which they did), luckily i could skip those parts. It was a good quick read through, covering key terms and principles. It could probably have been shorter and maybe more theoretical, but it was worth it for the parts that were good. Can recommend as an introduction to data pipelines. Not sure if it's a good reference piece really. I'd probably just use a search engine + relevant documentation
A very good primer on data processing and pipelines with code to consider the key elements involved in data pipelines for validation and iteration.
I agree with other reviewers that it is not a manual but more an survey of the field. It’s helpful certainly for a data scientist to know the concepts about. Much more detail and depth is needed if you’re looking for a standalone data engineering book 📖
Not a good book as an introduction to Data Engineering but rather a reference to do specific Data Engineering tasks or workflows. Did learn some things but began skimming about halfway through.
Other complaints: - Instructions aren't clear and had difficulty setting up environment to even complete the exercises - No conclusion highlighting the main takeaways from book
May try Fundamentals of Data Engineering next to see if that is any better and more suited to what I am looking for.
Hands on practical content for beginners. Got some good basics/prototype ready stuff out of it in production.
At the same time limits itself very much only to these basics, there could have been a chapter on Distributed Computing. Or using async patterns to cover more volume and variants of DE in real life.
It's a quick intro and reference on how to create and maintain data pipelines. A lot of examples in here are pretty dated. Does a good job at introducing the basics. Look, it's a O'Reilly pocket reference what else can you expect. Could have probably done more to speak more generally so that it doesn't just seem like a few articles on dbt, Airflow, Hadoop but hey, it's good for what it is.
Very well written small book. The only reason I've put 4/5 and not 5/5 was that...for some weird reason I hadn't realized that the book would be such short. So it was a small disappointment. But surely the content of the book was worthwhile, if only there was more...
I found this book practical, concise and to-the-point. It's just a starting point really, but I'd like to know more from the people here saying that it's already outdated. From my point of view, the book was worth it.
Takie 6+/10, kilka ciekawych podrozdziałów, ale, nawet jak na rozmiary książki, to podejście jest mega wąskie + z 1/3 to snippety kodu, które można znaleźć w każdym tutorialu/dokumentacji/wątku na stacku
Easy to read with great code examples. Really liked the less verbose and more practical approach. Highly recommended for anyone looking to pursue data engineering.
This book provides a great introduction to the various steps associated with building data pipelines. If you’re new to data engineering or want to underatand the high level steps associated with building data pipelines, this book acts as a great reference point
Nice Introduction to many data pipeline technologies, and to the point... Good book to get started as a data engineer, given you are already a senior developer.
Detailed examples given, primarily focused on Apache Airflow and Python. Clear explanations of each step. Would have benefited from a few examples of use cases across different industries. Overall great practical introduction to data pipelines.
Nice, quick overview of ELT pipelines focusing on SQL and Airflow. It covers a pretty narrow slice of the data engineering world but was still a useful read.