Cowritten by Ralph Kimball, the world's leading data warehousing authority, whose previous books have sold more than 150,000 copiesDelivers real-world solutions for the most time- and labor-intensive portion of data warehousing-data staging, or the extract, transform, load (ETL) processDelineates best practices for extracting data from scattered sources, removing redundant and inaccurate data, transforming the remaining data into correctly formatted data structures, and then loading the end product into the data warehouseOffers proven time-saving ETL techniques, comprehensive guidance on building dimensional structures, and crucial advice on ensuring data quality
This is one of the best written books (in a literary sense) for IT, which is why I wanted to 4-star and review it on Goodreads. I thoroughly enjoyed reading this book for its clarity, good use of language and eloquence. And what more, learnt a great deal more about Data warehousing :) I recommend it to anyone who has even the slightest of inclination towards databases, data modelling and data analysis. This book is a great value addition. The book gives you an eagle's eye view of the kind of data different (major) industries deal with and hence the industry-specific challenges in datawarehousing them and tips and useful hacks on how to tackle these challenges so that you come back to the data in your own business with a "refreshed perspective".
It has many great explanations to develop an etl system to build up a dimensional model. But I think if it includes to build up an Etl project from scratch it would be really better.
Very technical book when entering the later chapters.
👀 How this book changed my daily live (Takeaways)
The ETL system makes or breaks the data warehouse. Although building the ETL system is a back room activity that is not very visible to end users, it easily consumes 70 percent of the resources needed for implementation and maintenance of a typical data warehouse.
⁉ Spoiler Alerts (Highlights)
Fact tables hold the measurements of an enterprise. The relationship between fact tables and measurements is extremely simple. If a measurement exists, it can be modeled as a fact table row. If a fact table row exists, it is a measurement. What is a measurement? A common definition of a measurement is an amount determined by observation with an instrument or a scale.
In dimensional modeling, we deliberately build our databases around the numerical measurements of the enterprise. Fact tables contain measurements, and dimension tables contain the context surrounding measurements. This simple view of the world has proven again and again to be intuitive and understandable to the end users of our data warehouses. This is why we package and deliver data warehouse content through dimensional models.
"The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data" is a valuable guide for data professionals. The book provides detailed techniques for optimizing ETL processes, helping to effectively extract, cleanse, and transform data for use in warehouses.
If you are interested in practical implementation of such technologies in your business processes, you should pay attention to VentorTech. The company offers innovative solutions in the field of data management, including the integration of ETL tools with modern systems such as Odoo. Their approach allows you to automate complex processes and ensure the reliability of data processing.
Fundamentals and reference book for dimensional modelling. Learnt a lot about design patterns of dimension and fact tables as well as best practices. Presentation of summary available as pdf
"We expand the traditional ETL steps of extract, transform, and load into the more actionable steps of extract, clean, conform, and deliver, although we resist the temptation to change ETL into ECCD!", "Two simultaneous threads must be kept in mind when building an ETL system: the Planning & Design thread and the Data Flow thread.", "Planning & Design: Requirements/Realities ➔ Architecture ➔ Implementation ➔ Test/Release; Data Flow: Extract ➔ Clean ➔ Conform ➔ Deliver", "The process that involves transforming data from its original format to a dimensional data store accounts for at least 70 percent of the time, effort, and expense of most data warehouse projects." "The biggest risk to the timely completion of the ETL system comes from encountering unexpected data-quality problems. This risk can be mitigated with the data-profiling techniques discussed in Chapter 4." "For illustrative purposes, Oracle is chosen as a common dominator when specific SQL code is revealed", "Throughout the Toolkit series of books we have recommended a role-based approach to security where the ability to access the results from a data warehouse is controlled at the final applications delivery point. This means that security for end users is not controlled with grants and revokes to individual users at the physical table level but is controlled through roles defined and enforced on an LDAP-based network resource called a directory server. It is then incumbent on the end users' applications to sort out what the authenticated role of a requesting end user is and whether that role permits the end user to view the particular screen being requested. This view of security is spelled out in detail in Data Warehouse Lifecycle Toolkit.", "The back room and the front room of the data warehouse are physically, logically, and administratively separate. In other words, in most cases the back room and front room are on different machines, depend on different data structures, and are managed by different IT personnel. <...> No query services are provided in the back room. Read that sentence again! Our approach to data warehousing assumes that data access is prohibited in the back room, and therefore the front room is dedicated to just this one purpose." Для Microsoft BI это означает, что SSRS, SSAS и Sharepoint должны быть на отдельных серверах в White room. А SSIS должно быть в Black room. В эконом варианте данные из сторонних систем должны сначала загружаться в стейджинговые таблицы. А затем после Cleaning и Conforming загружаться в рабочие таблицы. "There are no indexes or aggregations to support querying in the staging area. There are no service-level agreements for data access or consistency in the staging area. All of these data access requirements are handled in the presentation area." "Keep in mind that the data-staging area is not a transaction environment and that you should not allow data to be manually entered into it. If manual tables must be maintained, an application should be developed outside of the data-staging area, and the resulting data should be provided to the ETL team and incorporated into the staging area via an ETL process." "transaction databases are designed to get data in; dimensional designs get data out. Staging-area de-signs must do both." "A data mart is a set of dimensional tables supporting a business process." "Data marts contain all atomic detail needed to support drilling down to the lowest level. The view that data marts consist only of aggregated data is one of the most fundamental mistakes a data warehouse designer can make. Aggregated data in the absence of the lowest-level atomic data presupposes the business question and makes drilling down impossible. We will see that a data mart should consist of a continuous pyramid of identically structured dimensional tables, always beginning with the atomic data as the foundation" "The mission of the data warehouse is to publish the organization's data assets to most effectively support decision making. The key word in this mission statement is publish. Just as the success of a conventional publica-tion like a magazine begins and ends with its readers, the success of a data warehouse begins and ends with its end users." "A data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making." "It has been the conventional wisdom that sorting, merging, deleting, replacing, and many other data-migration functions are much faster when they are performed outside the DBMS. Many utility programs are dedicated to text-file manipulation." Questions: 1. Unit testing in T-SQL, SSIS? Отладку кода T-SQL делать намного проще, чем отладку SSIS. 2. SSIS выполняются медленнее кода на T-SQL. Какие можно использовать альтернативы SSIS при использовании Microsoft BI? Или надо просто использовать выделенный сервер для SSIS? 3. Можно ли использовать SSIS не для batch processing, а для real time (streaming) processing?
This is an excellent book for anyone working in data warehousing and ETL. It solidified a lot of the knowledge that I've gradually picked up over the last ten years and further improved my grasp of theory and practice. There are some superb insights here, as well as plenty of detailed "how to" information. Not only that, but the book helps to cement how a proper, fully-fledged data warehouse project should be carried out and places the ETL function right at the heart of that endeavour.
Kimball's writing style is surprisingly approachable and readable, given the often dry subject matter. It's also impressive that, nearly 10 years on from the first edition, this book remains the definitive word on ETL.