A rich, narrative explanation of the mathematics that has brought us machine learning and the ongoing explosion of artificial intelligence
Machine learning systems are making life-altering decisions for approving mortgage loans, determining whether a tumor is cancerous, or deciding whether someone gets bail. They now influence developments and discoveries in chemistry, biology, and physics—the study of genomes, extra-solar planets, even the intricacies of quantum systems. And all this before large language models such as ChatGPT came on the scene.
We are living through a revolution in machine learning-powered AI that shows no signs of slowing down. This technology is based on relatively simple mathematical ideas, some of which go back centuries, including linear algebra and calculus, the stuff of seventeenth and eighteenth-century mathematics. It took the birth and advancement of computer science and the kindling of 1990s computer chips designed for video games to ignite the explosion of AI that we see today. In this enlightening book, Anil Ananthaswamy explains the fundamental math behind machine learning, while suggesting intriguing links between artificial and natural intelligence. Might the same math underpin them both?
As Ananthaswamy resonantly concludes, to make safe and effective use of artificial intelligence, we need to understand its profound capabilities and limitations, the clues to which lie in the math that makes machine learning possible.
*This audiobook contains a PDF of equations, graphs, and illustrations.
PLEASE When you purchase this title, the accompanying PDF will be available in your Audible Library along with the audio.
ANIL ANANTHASWAMY is former deputy news editor and current consultant for New Scientist. He is a guest editor at UC Santa Cruz’s renowned science-writing program and teaches an annual science journalism workshop at the National Centre for Biological Sciences in Bangalore, India. He is a freelance feature editor for the Proceedings of the National Academy of Science’s “Front Matter” and has written for National Geographic News, Discover, and Matter. He has been a columnist for PBS NOVA’s The Nature of Reality blog. He won the UK Institute of Physics’ Physics Journalism award and the British Association of Science Writers’ award for Best Investigative Journalism. His first book, The Edge of Physics, was voted book of the year in 2010 by Physics World. He lives in Bangalore, India, and Berkeley, California.
Anil Ananthaswamy does an excellent job of explaining the mathematics and intuition behind several popular machine learning and AI algorithms, including Support Vector Machines, Neural Networks, and Principal Component Analysis. He masterfully weaves together the history of these technologies while presenting the math and algorithms in an accessible way. The narrative highlights the collaborative nature of the field, showcasing how breakthroughs often build upon the work of others, propelling the field forward.
Given the breadth and depth of AI, it’s understandably challenging to cover every concept in detail. The book begins with foundational concepts and progresses to more advanced topics like Large Language Models (LLMs) and Deep Learning. While traditional machine learning algorithms are covered thoroughly, the discussion on more recent developments, such as LLMs, feels less comprehensive. Nevertheless, the author makes effective use of equations and visualizations, layering concepts in a logical sequence that offers a "graded ascent" (pun intended).
This book is particularly useful for readers seeking to revisit specific algorithms or practitioners looking to deepen their understanding and find references for further exploration. However, a minimal mathematical background and a willingness to engage with technical details are recommended, as some sections delve deeply into the math.
Overall, it’s a solid introduction to the field of machine learning, capturing both its elegance and complexity. The book’s approach does justice to its subtitle and will likely inspire further exploration of the subject.
The author helpfully explores the historical context of neural networks. But then quickly flies through the last decade of deep learning (i.e., Modern AI) in a few chapters right at the end.
I found it didn't provide much more than superficial insight into "Modern AI." (e.g., word2vec gets nothing more than a passing reference).
Why Machines Learn: The Elegant Math Behind Modern AI (2024) by Anil Ananthaswamy delves into the history of machine learning. This is combined with math to explain how machines learn. Ananthaswamy is a science writer who has written for New Scientist. He went to one of the Indian Institutes of Technology (IIT) and trained and worked in electronics and computing. He has written prize winning science books.
Stephen Hawking had an interesting alleged quote that he was told that ‘every equation cuts the readers in half’. This book ignores that advice and describes the history of machine learning (ML) with many equations.
The book starts with vectors and matrices and how they work. The focus then moves to Frank Rosenblatt and his invention of the perceptron. The perceptron is an algorithm that performs supervised learning and can do binary classification. Marvin Minsky criticized the perceptron. He pointed out that a single layer perceptron cannot perform XOR and some other functions.
In the next chapter gradient descent is described. Bernard Widrow’s work on adaptive filters is outlined. He then applied similar work to a model of a neuron to enable it learn. Ananthaswamy then describes Bayesian statistics and how probabilistic classification models can be built.
Principal Component Analysis (PCA) also gets a chapter. How PCA can be used to help with ML is described. Hopfield networks are then described. Ananthaswamy then writes about how George Cybenko’s proof that single layered neural networks could approximate any function.
The Backpropagation algorithm and the work of Rumelhart, Hinton and Williams then gets a chapter. This shows how modern neural networks learn. The application of neural networks for machine vision is then described. Finally the new work with LLMs is quickly written about.
Why Machines Learn has a lot of math in it. It’s not hard math, it’s math from the last few years of high school or the first years of University. But that’s a substantial hurdle for many people. It’s unclear whether people without a high level of math can benefit much from the book. But it is worthwhile to have a book that goes through the history of ML with math.
The big issue with the book is whether many people could really follow the math. This is difficult without a background in higher math or in ML. There are excellent courses, particularly Andrew Ng’s which provide a good introduction to ML. Indeed Ananthaswamy mentions some of these courses in the book.
I personally learnt more about the history of ML. I also learned about Hopfield Networks and George Cybenko’s proof about neural networks.
Why Machines Learn is a good book. Most people who known something about ML will learn something from the book. But it may be too hard for many.
Warning: Long book review ahead. The point of my long review below is to digest what I've learned, and hopefully to better remember what I've learned for future reference.
"Why Machines Learn" is a fascinating book focused on the math underlying modern AI. With minimal historical and conceptual coverage of AI/ML developments, the book focuses on illustrating - through equations and graphs - the math underlying machine learning. It took me back to high school and college days and it was an enjoyable challenge to follow the math as best as I could. I was able to mostly follow the logic of each individual equation through the first half of the book, whereas I chose to follow the logical train of thought more than the individual equation in the second half of the book.
"Why Machines Learn" draws on linear algebra, calculus, probability and statistics, and optimization theory. It starts out by discussing the perceptron, an algorithm for supervised learning (i.e., using annotated data) of single-layer neural networks that perform binary classifications. In the perceptron, ML learns the weight vector, given a set of input data vectors, to find a hyperplane that linearly separates the data into two groups. The XOR problem, however, stalled ML developments: instances of data points where one cannot draw a straight line to separate two groups (exemplified by four data points, two circles and two triangles, laid out in a square, equally distanced from each other). Minsky and Papert proved that a single layer of perceptrons could not solve such problems. This issue led to the first AI winter, from roughly 1974 to 1980.
Before covering the thawing of that AI winter, Ananthaswamy reviews additional pre-AI winter algorithms, including the least mean squares (LMS) algorithm, one of the most influential ML algorithms and foundational for training neural networks. LSM is the most widely used adaptive algorithm on the planet, used in some form (for example) by every modem in the world. It was also the first algorithm for training an artificial neuron that used an approximation of the method of steepest descent. Every deep neural network today - with millions, billions, possibly trillions of weights - uses some form of gradient descent for training. LMS is the foundation of backpropagation, and backpropagation is the foundation of modern AI.
Ananthaswamy additionally covers probability in predictive analytics, in both the frequentist and Bayesian variants. Bayes' theorem being used specifically to find posterior probability: the prior probability updated given the evidence. Maximum likelihood estimation (MLE) is an example of a frequentist ML algorithm, whereas Maximum a Posteriori estimation (MAP) is an example of a Bayesian ML algorithm. MLE works well for a lot of sampled data, whereas MAP works best with less data. As the amount of sampled data grows, MAP and MLE begin converging in their estimate of the underlying distribution. However, the sample size becomes an issue as we add more features - and in real-life ML problems, features can number in the tens, hundreds, thousands, or more. A trick statisticians and probability theorists use to make the problem more tractable is to assume that all features are sampled from their own distributions independently of one another. It's an assumption that makes the mathematics much easier, and is computationally far less intensive. This assumption is called naive Bayes, or idiot Bayes classifier.
The AI winter broke in the 1980s, especially with the 1986 publication of a pathbreaking paper by Rumelhart, Hinton, and Williams on an algorithm called backpropagation - which showed how to train multi-layer (as opposed to single-layer) perceptrons: neural networks with three or more layers (one input layer, one output layer, and one or more hidden layers in-between). It wouldn't be until the 2000s and especially the 2010s that computers would be powerful enough to handle the demands of these multi-layer neural networks, but when they did, a revolution in deep learning was unleashed. (GPUs were a significant contributor to that take-off as well).
The basis of backpropagation is to determine the error made by the network by comparing the produced output with the expected output and then figure out how to change the weights of the network based on the error so that the network produces the correct output. The algorithm makes the sequence of computations from input to loss differentiable at every step, which means that we can compute the gradient of the loss function. Given the gradient, we can update each weight and bias a little bit and thus perform gradient descent until the loss is acceptably minimized. With backpropagation, you can in principle construct a network with any number of layers, any number of neurons per layer: just provide the network with a set of inputs, figure out what the expected output should be, calculate the loss, calculate the gradient of the loss, update the weights/biases, and rinse and repeat.
Another ML concept, developed by Yann LeCun, is the architecture of the convolutional neural network (CNN), used for grid-like structures, and trained using backpropagation. CNN pools information from many vectors into fewer vectors (illustrated in the book by an image grid of 5x5 being turned into a 4x4 grid and then a 3x3 grid). Max pooling reduces spatial features. -> Designer of CNNs has to make decisions about hyperparameters not learned during the training process, decisions that influence the network's performance: including the size and number of kernel filters, size and number of max pooling filters, number of convolution and max pooling layers, size and number of fully connected layers, and activation functions. "Fine-tuning, or finding the right values for, the hyperparameters is an art unto itself. Crucially, these are not learned via backpropagation" (p. 373).
In ML, each input variable is considered a dimension, and the more complex ML algorithms use thousands or more dimensions (sometimes MANY more). Often, however, much of the variation in high-dimensional data needed to distinguish clusters lies in some lower-dimensional space. Principal Component Analysis (PCA) is used to reduce the data to a tractable number of lower dimensions. If you're having difficulties linearly separating lower-dimensional data, you can do the opposite of PCA: project the data into higher dimensions, sometimes even into an infinite-dimensional space, where there is always some linearly separating hyperplane. This is the basis of the Support Vector Machine (SVM), which uses kernel functions to map data into higher-dimensional spaces.
The SVM algorithm "rocked" the ML community in the 1990s and 2000s, and has been credited for enabling machines to recognize everything from voices, handwriting, and faces to cancer cells. Research in SVMs and kernel methods to some extent derailed research into neural networks until the latter came to dominate ML again from the 2010s onwards. The resulting deep learning revolution happened as researchers started to massively increase the number of hidden layers, using massive amounts of training data and computing power. Today's deep neural networks use billions or even hundreds of billions of neurons and tens or even hundreds of hidden layers - and have led to significant improvements in computer vision, natural language processing, machine translation, medical image analysis, pattern detection in financial data, and much more.
Whereas most of the book covers the mathematics behind ML, the final chapter (Chapter 12) and the Epilogue discuss how we've arrived at a point where standard ML theory and our mathematical understanding can no longer adequately explain why today's deep neural networks work so well. This is the concept of grokking, first discovered by OpenAI in 2020. Typically, models have a bias-variance trade-off: the simper the model, the greater the bias (leading to underfitting, and a higher risk of both training and test errors); the more complex the model, the greater the variance (leading to overfitting, a lower risk of training error, and a higher risk of test error). In models that are too complex, the classification boundary tracks every little deviation in the data, overfitting them: resulting in the model doing really well on the training data, but making significant classification errors during testing. Theoretically, neural networks should follow the Goldilocks principle, as the best model is one that lands somewhere in the middle of the two.
However, deep neural networks are not behaving like this. Instead, as depicted on a double-descent graph on page 405, as networks become bigger, after initial overfitting, the test errors are minimized again. Deep networks are over-parameterized and should overfit and not generalize well to unseen test data, YET THEY DO. This absence of overfitting in very complex model, or the ability to generalize from the data long after initial overfitting, is what is called grokking, and it's something that standard ML theory and math cannot adequately explain. With grokking, networks do better than could be expected from a model that had simply memorized the training data, seemingly being able to "understand" and reason about the data's rules and patterns at a deeper level. [Note that I do not have the expertise to fully evaluate if the author is overstating the discrepancy from traditional ML and math here, and/or whether grokking might be better understood and explained today.]
"Why Machines Learn" ends with very brief coverage of ChatGPT and related generative AI, stating that one of the most significant developments of the past 5 years is self-supervised learning, a method that takes unlabeled data and creates implicit labels without human involvement and then supervises itself (and is the basis of ChatGPT). This has freed ML from having to use expensive, human-annotated data. The author notes that scaling - using either more parameters or more training data or both - has produced "emergent" behavior, leading to debates on whether LLMs are starting to reason and even model the world. "Researchers are at odds over whether LLMs are actually doing any reasoning or are simply regurgitating text that satisfies the statistical patterns and regularities they encounter in the training data, or even whether there's any meaningful difference between these two ideas" (p. 403). My best guess is that LLMs are more regurgitators than anything, and even the reasoning that is being produced is just a result of taking more time to respond and break things down into steps. But perhaps there's some truth to there being less of a stark difference between the two. It will definitely be interesting to see what the future holds in this regard.
I enjoyed this book. I must say, Anil is an excellent writer and scientific communicator. The book provides a historical and mathematical perspective on the emergence of machine learning. Classical machine learning and deep neural networks are covered and explained clearly, with the underlying mathematics presented lucidly.
However, if you're looking for detailed information on generative AI, you might be slightly disappointed, although it's briefly covered in the final chapter.
Overall, it's a highly enjoyable read, and I would recommend it to anyone interested in machine learning.
4.5🌟 Learned a lot of cools things! Prose is very reminiscent of Microbe Hunters. Wanted more elaboration on the applications of back propagation. Book feels a little simple, but blends the history and mathematical concepts behind AI seamlessly. Should not have listened to on Audiobook. A PDF of the equations mentioned is provided, but I'm not pulling up an equation sheet while I'm driving. A must read for anyone using AI (which is everyone I guess)
Juicy. Brilliant ground-up refresher on linear algebra in the context of ML. For the purpose of developing my own algorithms, the walk through of the technology's evolution was invaluable. Beautifully story telling for an otherwise complex topic
Comprehensive and well-scaffolded with a non-condescending choice of ZPD that challenges you to keep up. That said, I am not sure why this book needs to exist compared to, for instance, Andrew Ng’s course. Maybe for its attempt to humanize the genre with its biographical interludes, but I just found all that hopes-and-dreams emotional content distracting and, with a little help from my joyless heart, condescending.
Notes Hebbian learning - neurons that fire together wire together (strengthens when one neuron’s output is consistently involved in the firing of another).
Dot product as the projection (shadow) of one vector on the other, that’s why it is 0 when they are orthogonal.
There exists a linearly separating hyperplane characterized by weight vector w* whose values keep updating in each iteration to fit the data (dot product of w.w* converges to zero as weight vector points in direction of desired w*. But why wont it continue indefinitely? Convergence theorem of Minsky/Papart - dot product keeps increasing as cosine goes from 0 (orthogonal) to 1 (parallel). But w.w* can also increase if magnitude of w increases. So compare with w.w (only increases if magnitude increases) vs w.w* (increases with both magnitude and direction) and choose only when latter increases.
Taking partial derivative of a function, irrespective of number of variables means we can express the gradient as a row/column vector.
The old modem sound is the sound of a handshake, two digital devices figuring out the best way to talk to each other over a phone line used for analog voices - an adaptive filter figures out the characteristics of noise in transmissions of ones and zeroes to filter it out: a previously agreed upon signal (dn) compared with output (yn) after filtering input (xn) to feed back the error (en).
Mean-squared-error better than mean-average-error (just take absolute value) because differentiable everywhere, punishes extreme outliers
Even Erdos refused to accept Monty Hall solution, but Vazsonyi showed 100k simulations that switching had 2X - 2/3rd over 1/3rd - chance of winning. Probabilities are not static, they change with context.
Monty Hall intuition - change doors from 3 to million. Now host opens million minus 2, the one you chose and one other mystery door. Obviously you’d choose that one door he’s singled out to not open. Intuition 2 - put box around your door (1/3rd chance and another box around unchosen (2/3rd chance). Now after host opens one door in the other box, entire 2/3rd chance has shifted to the other door.
Uncertainty over Bayes’ year of birth - was born in 1701 with probability of 0.8.
In PMF (probability mass function) where variable takes discrete values, possible to talk of probability, but in PDF (density function) for continuous variable, probability that it has some specific precise value is zero, can only speak of probability between say, 98.25 degrees F and 98.5 deg: area under the PDF.
MLE (maximum likelihood estimation) keeps all values of parameters (eg: mean, variance) as equally likely and finds best distribution vs MAP (maximum a posteriori) takes prior probability to make assumption about parameters without having seen the data.
BOC Bayes Optimal Classifier is the best any ML alg will ever do - know/estimate underlying distribution (prior) to calc posterior bayesian prob.
When dealing with large number of features - mutual independence assumption - assume variation in say bill depth has nothing to do with bill length to estimate distribution independently, though not true in nature, less computationally intense.
Voronoi cell diagram for Soho cholera pumps (each point inside a cell is closer to its seed than any other)
Theories of vision - intromission (objects emanate bits of matter communicating size, shape, color). Extramission (eyes emanate rays that intercept objects). Alhazen’s theory of light radiating in straight lines from object, enters eyes, faculty of discrimination compares with knowledge or imagination
NN Nearest neighbor algorithm plots new point in hyperdimensional vector space, finds nearest neighbor and applies same label to classify new point. Fix issues of misclassified base data by making it odd-number multiple nearest neighbors (3,5 etc), odd so it doesn’t result in a tie. Solves overfitting of non-linear boundary between labels.
As opposed to BOC, NN makes no assumption about underlying distribution. But non-parametric (no weights for features), uses entire dataset, so not scalable.
Volume of unit sphere in d dimensions has a formula involving Gamme function in denominator that increases much faster than numerator as d increases, so as dimensions go to infinity, volume of unit sphere goes to 0. But volume of unit hypercube is always 1 regardless of dimensions of hypercube. 1000-dimension cube has 2^1000 vertices (more atoms than universe), but each surface (where it touches hypersphere) is only 1 unit from the origin, meaning most of the space is at the vertices (like the atom?) - all data points occupy the vertices, equidistant from one another, so k-NN falls apart.
Principal Component Analysis PCA projects high-dimensional data onto smaller number of axes.
Eigenvector is the characteristic of a matrix such that when multiplied with matrix, produces vector unchanged but scaled up/down by an eigen-value. So unit vectors that form a circle when multiplied with matrix transform into an ellipse (stretched and compressed) - eigenvectors lie along minor and major axis of the ellipse, perpendicular to each other (when matrix is square symmetric).
Centering: replace values with mean-corrected values (subtracted expected value). X(t).X then gives square symmetric matrix where diagonal captures variance (sum of square of difference) and off-diagonal captures covariance. (sumproduct). Eigenvectors of covariance matrix are principal components of original matrix.
Unsupervised learning - only told how many clusters there are, no labels, then it tries to iteratively find the geometric center (centroid) of each cluster.
How to find linear hyperplane in data with non-linear boundaries - project onto higher dimension by creating new feature that is combination of existing dimensions. Computation problem - Kernel Function can compute dot products in higher-dimensional space without needing each low-dimensional vector into its massive counterpart.
Ferromagnet has aligned magnetic moments of ions, so net magnetism, analogous to crystal structure. If random, then analogous to glass. Hence materials with disordered magnetic moments are called spin glasses.
Hamiltonian calculates total energy of system by summing product of all interaction of spins between neighbors. When spins are aligned, energy decreases, when opposed, energy decreases, so alignment is system trying to reach lowest energy level.
Associative memory - if stable output represents memory, and distortion takes system into unstable state, then dynamics take over to bring system back to stable state whose output can then be read off to retrieve memory.
Make all neurons talk to all other neurons (like magnetic spins in a lattice), set symmetric weights for each connection through Hebbian learning (w12=w21=y1.y2), or in matrix form W = yT.y - I (identity matrix just to replace 1s along diagonal to 0s).
Complex cell translation invariant - vertical edge can be anywhere in receptive field, as long as it triggers some simple cell, it triggers complex cell.
Hypercomplex cell rotation invariant - regardless of orientation, the cell will fire as long as there is an edge in the receptive field. Fire maximally for edge of a particular length, not longer or shorter.
Hypercomplex cells combined to form shapes like chevrons, then further for squares/triangles etc.
Convolution: take 5x5 image, place 2x2 kernel on left corner, multiply all 4 values, replace a11. Move Kernel to a12. Different kernels do different things. Prewitt kernel highlights horizontal and vertical edges - give us translational invariance like visual system of brain.
Max pooling places kernel and takes highest value pixel under it. Reduces size, number of neurons needed. Increases receptive field (translational invariance).
Grokking is from stranger in a strange land - not just understanding but internalizing and becoming the information. Like a phase change from water to ice - go from memorized table of answers to becoming knowledge.
Standard bias-variance curve is U-shaped, under-parametrized with high bias (underfitting) dropping to minimum and then rising again (has interpolated - overfitting and taking noise seriously). But modern ML starts from the right of this curve, falling from a maximum at interpolation threshold to a minimum that keeps dropping.
LLMs prediction is just a conditional probability distribution. Predicting ‘the dog ate my..’ is P(Cat | the, dog, ate, me) vs P(biscuit | the, dog, ate, my) etc.
Basic problem to solve for brain is same as backpropagation - credit assignment, ie how to assign blame of error to each of network’s parameters.
This book gives an overview of the math that powers Deep Learning algorithms, I was goin to say high level, but that's not quite right. It goes into great depths, describing the concepts in lay terms. This book will give you an intuition for thinking about large language models (LLMs) like ChatGPT and such. I love that the author gives clear explanations to complex topics. I really enjoyed getting the history and the people behind each discovery, the book goes in chronological order. I have been studying and following AI since it was called "Pattern Recognition" in grad school, I took courses on "Machine Learning" at the University of Washington about 10 years ago, right around the time Deep Nets were stating to gain traction. I was starting to think I was not going to get anything new from this book but the last chapter of the book covers what's been happening since 2020 and it made me change my mind about this field. I don't want to "spoil" it, but I didn't know about that fact that Neural Networks are going past the theoretical bias/variance "Goldilocks" limit. Reading about that made me take another look at the tech. Like I said, I've been skeptical about NN and the "hype" of AI, but reading about this made me want to know more.
"Why Machines Learn" is a unique perspective on the evolution of machine learning, making it a valuable resource for a specific audience. While the book garnered a 5-star rating from this reviewer, it's important to understand its strengths and weaknesses to determine if it's the right fit for people picking up the book.
The book's core strength lies in its comprehensive exploration of the mathematical foundations of machine learning, particularly in the pre-deep learning era. For those unfamiliar with the field, "machine learning" (ML) is a subfield of artificial intelligence that enables computers to parse data rigorously and far more elaborately than regressions and other simpler linear statistical methods that are grossly inadequate in practical life. This book fearlessly dives into the mathematical underpinnings of various ML algorithms, providing a reasonably accessible journey through the concepts for those well-initiated. It covers a significant period, from the late 60s until the arrival of Convolutional Neural Networks (CNNs) adequately and diligently. It works through the key phases as defined by the arrival of major concepts like Principal Component Analysis, Support Vector Machines, Backpropagation, Gradient Descent, and the likes.
However, the book's mathematical depth can be a double-edged sword. Readers far removed from the field might find the heavy emphasis on mathematical formalism overwhelming. Conversely, experts deeply entrenched in modern machine learning may find much of the content too basic. The book's brevity also means it dedicates limited space to newer, dominant methods.
The book hits a sweet spot for a specific group like this reviewer. Those with a solid understanding of pre-transformer ML concepts will find it a refreshing and insightful read. It allows for a rapid yet thorough review of familiar territory, offering new perspectives by presenting these concepts in close proximity. In some ways, and this is in the language of this reviewer, Machine Learning is about optimizing and squeezing "degrees of freedom" (DoF) to analyze data without overuse. In ML terms, it means combining concepts like Principal Component Analysis (PCA), which reduces DoF, with techniques like Kernel methods, which increase DoF, and Support Vector Machines (SVMs), which aim to find optimal solutions within this complex space, to name some of the concepts from the book.
A key challenge in traditional machine learning was balancing the trade-off between overfitting and underfitting. Overfitting, where a model memorizes data instead of learning general patterns, led to poor generalization, while underfitting, where a model was too simple to capture meaningful trends, reduced its predictive power. The appropriate term for underfitting in this context is underparameterization, which describes a model that lacks sufficient flexibility to capture complex patterns in the data. Classical models addressed this challenge through regularization techniques, dimensionality reduction, and careful model complexity selection. These strategies ensured that models remained interpretable and grounded in statistical rigor, balancing variance and bias. The book shines in tracing how these methods, rooted in a disciplined, almost statistical approach, sought to find the "best" hyperplane to categorize data.
Before the rise of neural networks, machine learning was deeply rooted in mathematical and statistical formulations. All these methods were built on deterministic optimization techniques with clear mathematical foundations. While the terminology used in statistics differed from machine learning, the core principles remained aligned, focusing on structured ways to represent data and optimize models. Techniques like kernel methods enabled transformations that made complex data separable, and optimization-driven classifiers such as SVMs sought to maximize decision boundaries with rigorous mathematical constraints. The reliance on well-defined equations and statistical properties made these approaches fundamentally different from the heuristic-driven methods that would later emerge with deep learning. The book is primarily about the disciplined era of ML.
The book's biggest missed opportunity lies in its extremely light treatment of the shift towards more heuristic approaches in modern machine learning, specifically the deep learning revolution. Neural networks, especially deep neural networks, challenged the traditional "variance-bias tradeoff." The emergence of neural networks, particularly deep learning, introduced a fundamental shift in how models were designed and trained. On the surface, increasing the number of layers in a neural network seemed analogous to adding kernel transformations in traditional ML, as both approaches expanded the feature space and allowed for greater model expressiveness. However, neural networks diverged sharply due to their reliance on heuristics and stochastic processes. Instead of deterministic optimization, they employed randomly initialized weights, stochastic gradient descent, and non-convex loss functions, making their learning process far less predictable. Unlike classical models where solutions were mathematically determined, neural networks produced different outcomes depending on initialization and training conditions, fundamentally changing how learning was approached.
What the book does not discuss in sufficient detail is that the role of randomization in deep learning further distinguished it from earlier machine learning methods. Weight initialization was no longer based on deterministic rules but rather on probabilistic methods, leading to different learning trajectories even for the same dataset. Training deep networks required iterative adjustments using mini-batches of data rather than full-batch optimizations, introducing additional uncertainty. Techniques like dropout and batch normalization, which had no direct analogs in traditional ML, emerged as necessary adjustments to stabilize learning. While these approaches improved performance, they lacked the formal mathematical justifications that defined classical ML, making deep learning more of an experimental science than a strictly theoretical discipline. The book’s clear bias towards theoretical justifications made it go extremely light on these aspects of the methods important today.
The following is covered in the book, but relatively in passing compared to the space offered to pre-NN concepts and compared to the importance of these phenomena. One of the most surprising aspects of deep learning was its ability to disrupt the classical variance-bias tradeoff. In traditional models, increasing complexity inevitably led to overfitting, but deep networks, even with billions of parameters, often generalized better than simpler models. This phenomenon, known as double descent, contradicted long-standing assumptions in statistical learning theory. It suggested that adding parameters improved rather than harmed performance beyond a certain threshold, a counterintuitive result that remains only partially understood. Unlike earlier ML models where generalization was explicitly controlled through mathematical constraints, deep learning seemed to achieve it through emergent properties (not everyone’s favorite term), raising questions about the underlying mechanisms driving its success.
Deep learning neural networks are highly complex models that defy descriptions. This is perhaps the reason behind their light treatment. They work in ways that are not yet fully understood by theoreticians. The book mentions the defiance of the tradeoff, including concepts like double descent, but could have covered the latest neural architectures far more than mere mentions. These almost post-ML methods rule the world of AI today, but not in the book. While their mysterious workings are acknowledged, the book could have covered at least the attention formulas in a few chapters given their importance.
Despite this weakness, this is a great book for at least certain types of enthusiasts.
A fascinating book that looks at the history of Machine Learning (ML) to show how we arrive at the machine learning models we have today that drive applications like ChatGPT and others. Mathematics involving algebra, vectors, matrices, and so on feature in the book. By going through the maths, the reader gets an appreciation of how ML system go about the task of learning to distinguish between inputs to provide the (hopefully) correct output.
The book starts with the earliest type of ML, the perceptron, which can learn to separate data into categories and started the initial hype over learning machines. The maths are also provided to show how, by adjusting the weights assigned to its testing input, the machine discovers the correct weights which can allow it to categorize other inputs.
Other chapters then cover other ways to train a machine to categorize its input is shown, based on Bayes Theorem and nearest neighbour. They have their advantages and disadvantages: choosing the right (or wrong) way to train a machine will have an impact on how well the machine can categories its data.
Matrix manipulation, eigenvalues and eigenvectors are then introduced. When there are many input parameters, it can be hard to categorize them based on all the factors. By using eigenvalues and eigenvectors, it is possible to discover which factors cause the most variation among the data, and thus categorize them. And, in an interesting reversal, it is also possible to manipulate the input by putting them into more categories, which can reveal patterns that can then be used to categorize the input.
These ML models categorize input data using one level of 'neurons'. The next step would be to introduce a 'hidden layer' of neurons that can be used to combine the incoming data in many ways, which provides new ways to manipulate the data for categorization. This would provide a boost in the abilities of machines to recognize input data.
Lastly, the book catches up to current day ML models, which feature a huge increase in the number of hidden layers and weights used to manipulate input data. The book then points out that this huge increase has caused the theory of how machines learn to fall behind: the machines now exhibit abilities that theory cannot account for. The ability of such machines to pick out patterns in data through self-learning, rather than being pre-fed known data, is also an unexpected feature that current ML theories cannot account for.
These unaccounted features of current day ML systems are a probable cause of concern. So too is the concern over the kind of data being pre-fed to the systems: data that comes with various biases that only cause the system to make yet more biased decisions. Until we know better how these systems behave, it would be best to treat their outputs with caution.
I love nonfiction books that simplify concepts without dumbing them down beyond recognition. The point of a popular science book is to inform a more general audience about a specific science. Popular science should clearly convey the consensus and the conflicts in that science. Why? Because anyone not deeply familiar with the field will take the popular science writings as gospel. Anil does this wonderfully! We follow along different strands of the field as they each attempt to tackle the same or similar problems, we see their successes and discover their limitations.
I really enjoy when authors teach us the tools needed to "discover" the phenomenon at hand. For example, Anil takes several detours into matrix algebra and calculus to provide us with the tools necessary to understand the inner workings of popular ML algorithms.
The historical accounts, quotes from interviews with researchers, as well as mathematical notation coupled with the illustrations and pedagogical explanations in text, makes this book accessible to anyone with high-school math. Even without high-school math, if you're willing to take your time to learn some math through this book, I think you'll thoroughly enjoy it.
Having said that, I could only read this book when I had a couple of uninterrupted hours in a row (like taking a 3 hour train ride). I don't think it's very easy to pick up where you left off if it's been too long between your reading sessions. This book is in the same spirit as other books that I absolutely loved: "When Least is Best" (Nahin) and "The Biggest Ideas in the Univers #1" (Carroll), and to some extent "Chaos" (Gleick).
Научно-популярная история математики искусственного интеллекта от 50-х годов до наших дней. Мое любимое сочетание понятного объяснения, как все работает, и любопытных историй из жизни и работы ученых. Не знал, кстати, что многие из ученых, сформировавших основу современных нейросетей, советского и российского происхождения.
Математика объясняется подробно, с векторами и матрицами, но вполне по-человечески: я наконец-то понял, как работает обратное распространение в нейросетях — и убедился, что никто не понимает, почему работает deep learning. Даже стало жаль, что в нулевых на нашей кафедре интеллектуальных систем МГУ о нейросетях не рассказывали в таком стиле.
Цитаты:
«I found myself repeating ideas and concepts over the course of writing this book, sometimes using the same phrasing or, at times, a different take on the same concept. These repetitions and rephrasings are intentional: They are one way that most of us who are not mathematicians or practitioners of ML can come to grips with a paradoxically simple yet complex subject.
I hope your neurons enjoy this process as much as mine did. »
«There’s delicious irony in the uncertainty over Thomas Bayes’s year of birth. It’s been said that he was “born in 1701 with probability 0.8.”»
«No parent duck sits around labeling the data for its ducklings, and yet the babies learn. How do they do it? Spoiler alert: We don’t know, but maybe by understanding why machines learn, we can one day fully understand how ducklings and, indeed, humans learn.»
«Hinton, the final speaker at the bootleg session, started with a quip: “So, about ago, I came home to dinner, and I said, ‘I think I finally figured out how the brain works,’ and my fifteen-year-old daughter said, ‘Oh, Daddy, not again.’»
«A team member who was training the neural network went on vacation and forgot to stop the training algorithm. When he came back, he found to his astonishment that the neural network had learned a general form of the addition. It’s as if it had understood something deeper about the problem than simply memorizing answers for the sets of numbers on which it was being trained».
«At the time, Wiener was MIT’s best-known professor. Decades later, Widrow, recalling Wiener’s personality in a book, painted a particularly evocative picture of a man whose head was often, literally and metaphorically, “in the clouds” as he walked the corridors of MIT buildings: “We’d see him there and he always had a cigar. He’d be walking down the hallway, puffing on the cigar, and the cigar was at angle theta—45 degrees above the ground. And he never looked where he was walking…
Even as he approached the steps at the end of some hallway, Wiener would be looking up, not down. “You can see he’s going to kill himself—he’s going to fall down those steps—but if you disturb him, you might break his train of thought and set science back like ten years! There was always that problem.”
«For Hinton, the United States was a revelation after the academic “monoculture” of Britain, where there was the right way to do things and where everything else was considered heresy. Neural networks constituted heresy. “And the U.S. is bigger than that. In particular, it’s got two coasts. They can each be heresy to the other,” Hinton said.»
«There’s a very important and interesting question about whether biological brains do backpropagation. The algorithm is considered biologically implausible, precisely because it needs to store the entire weight matrix used during the forward pass; no one knows how an immensely large biological neural network would keep such weight matrices in memory. It’s very likely that our brains are implementing a different learning algorithm.»
«Deep nets have way too many parameters relative to the instances of training data: They are said to be over-parameterized; they should overfit and should not generalize well to unseen test data. Yet they do. Standard ML theory can no longer adequately explain why deep neural networks work so well.»
«Mikhail Belkin of the University of California, San Diego, thinks that deep neural networks are pointing us toward a more comprehensive theory of machine learning. He likens the situation in ML research to the time in physics when quantum mechanics came of age. “Everything went out of the window,” he said.»
Dr. Ananthaswamy does a terrific job at explaining the underlying mathematical principles that guide modern machine learning. Dr. Ananthaswamy ensures that the reader can follow along with the developments of machine learning from a historical point of view and explains the logic and the problems that the scientists were attempting to solve at these points in time. As the title suggests an enormous focus is placed on the mathematics, from explaining how to calculate a gradient, the chain rule in calculus, to Bayes Theorem in statistics, Dr. Ananthaswamy does a great job at focusing on some of the most important core mathematical ideas that guide modern machine learning.
Personally I have a master's degree in mechanical engineering, so my degree already included a lot of linear algebra and calculus which is fundamental to understanding this book but also machine learning in general. Dr. Ananthaswamy tries his best to make this as accessible to everyone and I think for the most part he succeeded. I would recommend this book to anyone but I think those that will have the easiest time with it are those who have taken at least one course in linear algebra in college or taught themselves the same principles. For those that have not experienced this level of math I still believe that there is a lot of use out of this book and he really does his best to explain every main concept. If you have at least a basic algebraic understanding of mathematics (such as what is y=mx+b) then you really can understand what he is speaking about. He tries to segregate some of the more "mathy" sections into codas located at the end of the chapter.
Personally I took notes while reading this book, this book is not to be a textbook however I think to get the most out of it it would be wise to take notes and follow along with the math problems he provides in text (per chapter there aren't too many, he will generally have two examples to show a concept and then apply that concept to a hypothetical machine learning problem).
I have been trying to understand machine learning since 2018 and I think this book truly illuminated what is happening at a fundamental level and the logic of why certain approaches are the way that they are. I would recommend this book to all people, it is certainly challenging but not impossible and he really tries his best to make this as easy as possible to understand. The ultimate goal of this book is not to just show all the equations but rather to provide a starting point for developing ones intuition. If one is debating on pursuing an advanced degree into this topic I think this would be essential to see if this subject is really for you, because from what I have heard is that most of the work will be math problems before any hands on work like building your own neural network. In any case I highly recommend reading this book to anyone who is interested in the field or is curious to see how applied mathematics can assume one form. This book is truly for all.
I approached "Why Machines Learn" with an appreciation for the intricate dance between theoretical concepts and their practical applications. It delves deeply into the mathematical frameworks that have driven the remarkable advancements in ML and AI.
Ananthaswamy excels in breaking down complex mathematical ideas into digestible segments. From Rosenblatt's perceptrons to contemporary deep neural networks, the book navigates through decades of developments with clarity. The author's ability to explain linear algebra, calculus, and other foundational mathematical concepts is commendable, making these subjects accessible even to those without an extensive background in mathematics. This is crucial for a broader audience to appreciate the profound implications of these algorithms.
What I found to be one of the book's strongest points is its integration of the social and historical contexts within which these mathematical advancements occurred. By weaving narratives of key figures in AI, such as Geoffrey Hinton and others, Ananthaswamy provides a richer, more nuanced understanding of how these technologies evolved. This approach not only humanises the scientific endeavour but also highlights the collaborative nature of scientific progress.
The book does not shy away from discussing the real-world applications and ethical dilemmas posed by ML systems. Ananthaswamy explores how these algorithms impact critical areas like medical diagnostics, financial decisions, and criminal justice, prompting readers to consider both the capabilities and the limitations of AI. This balanced perspective is essential in an era where AI is increasingly intertwined with everyday life.
While "Why Machines Learn" is highly informative, there are areas where it could delve deeper. For instance, the mathematical discussions, while clear, sometimes gloss over the more intricate proofs and derivations that a mathematically sophisticated audience might crave. Including appendices or supplementary sections with detailed mathematical treatments could enhance the book's appeal to a technically proficient readership.
Additionally, the book could benefit from a more thorough exploration of cutting-edge topics such as quantum ML and the implications of AI for theoretical physics. Given the rapid pace of advancements in AI, a forward-looking chapter on speculative developments and future trends would have been a valuable addition.
Why Machines Learn is a fascinating, thorough, and semi-accessible monograph on AI and how machine learning algorithms are changing -everything- in modern society, meticulously delineated by Anil Ananthaswamy. Due out 16th July 2024 from Penguin Random House on their Dutton imprint, it's 480 pages and available in hardcover, audio, and ebook formats. It's worth noting that the ebook format has a handy interactive table of contents as well as interactive links throughout. The ebook format's search function is also helpful for finding info and references (recommended).
The author does a very good job covering the background and development of machine learning and relates a lot of the human history and key players from the 50s to the current day. It's a timeline with an ever increasing pace and he draws a clear line from the creeping forward progress to the hurtling (scary) pace of the current day.
Oddly enough, the author doesn't speculate about the near (or far) future of AI and machine learning, and his insights would've been a valuable addition to the book. The specific mathematics included in the book were at an odd level as well; too simple for people conversant with the material, and probably too complex for non-math-inclined laypeople.
Although written in mostly accessible, non-rigorous language, it's meticulously annotated throughout, and the chapter notes will provide readers with many hours of further reading.
Four stars. Definitely a niche book, but well written and interesting. It would be an excellent choice for public library or post-secondary library acquisition, for home use, or possibly as adjunct/support text for more formal math/science history classes at the post-secondary level.
Disclosure: I received an ARC at no cost from the author/publisher for review purposes
Why Machines Learn delivers on what the title promises, explaining how AI is built on some basic math, including Probability, Statistics, Linear Algebra, and Geometry, and how computers use that math to “learn.” The book starts by talking about very basic Machine Learning (which, in its simplest form, is Linear Regression) and walks you through the history of how we learned to use computers to build models from data, and goes on to discuss machine vision and LLMs. You will learn how ML and AI systems work, as well as their limitations and why they exist. The book wraps up with an epilogue about the limits and risks of AI, including the errors that come from biases in the underlying training data and models.
The Book has significantly more math than you might expect from a non-textbook; this is good if you want to go beyond the superficial. However, it could be intimidating if you’re unfamiliar with the notation. Ananthaswamy does an excellent job of guiding you through it, explaining the notation and the results; even if you skim certain parts, you will still understand the essential results and history of AI. If you aren’t in the mood for scanning through equations and math at all, you might want to skip this book, but if you feel comfortable with some of the basic linear algebra and are willing to treat some of the details as abstractions, it’s a worthwhile read.
The discussion at the end of the book about bias and error talks about how LLMs confidently make statements even though they can have errors due to biases in their training data. While this is true and should be considered when evaluating results, it’s also worth noting that humans are subject to the same challenges. I would have loved a discussion about whether machines are a bigger risk because we assign them more credibility or neutrality.
This is a great book for someone who wants to know more about the underlying science and technology behind Machine Learning and AI, but who doesn’t feel the immediate need to build their own models and programs,
4.5/5 (very few books get 5/5, but this one is close)
I've read a good many books about AI. Some tackle the history, some address the varieties of AI, some talk about the societal impact of AI... This one is a gentle introduction to the math behind AI. If you remember your senior high school year, or junior University math then you can probably follow this book. If things like matrices, dot products, calculus, the chain rule, probability and Bayes rule don't ring a bell you may have a tough time.
Despite the complex subject matter, the author strives to make the content understandable for a general audience, using clear explanations and illustrative examples. Lots of diagrams, but also formulas (you've been warned). The book draws connections between machine learning and other scientific disciplines, including neuroscience and physics.
You'll gain some insight into how advancements in computer science and hardware have fueled the recent explosion in AI capabilities. It's one thing to come up with the theories, but you actually have to have enough data, and enough computing capacity to test things out. Contrary to what some my expect sometimes the theory doesn't align with what we observe. This is why it's appropriate to view AI as legitimate computer science. (Keep this in mind when you hear "oracles" like Sam Altman talk about AGI, time will tell.)
The author addresses the current limitations of machine learning theories and the challenges posed by rapidly advancing AI systems. Some sophisticated chatbots seem as if they are sentient, but they aren't; frankly it is quite amazing to see the emergent properties of some of the big LLMs, but they aren't actually disembodied intelligences. Understanding the underlying math helps you to see why. This is important so we can correctly conceptualize AI's capabilities and limitations as it becomes increasingly integrated into our daily lives.
Anil Ananthaswamy (http://anilananthaswamy.com) is the author of more than 5 books. Why Machines Learn: The Elegant Math Behind Modern AI was a few days ago. It is the 55th book I completed reading in 2024.
I received an ARC of this book through https://www.netgalley.com with the expectation of a fair and honest review. Opinions expressed here are unbiased and entirely my own! I categorize this book as G.
The author covers the historical evolution of Machine Learning (ML). He covers the conception of artificial neurons and how they have been used. He discusses Bayesian reasoning and how important matrices are to ML. He proceeds to explain support vector machines. A great deal of time is spent on artificial neural networks. How they work, and how they are trained. Backpropagation is explained, and examples are given. Ananthaswamy describes how neural networks can be applied to image recognition and delves into deep neural networks.
I enjoyed the 11+ hours I spent reading this 476-page book on technology. I have read many books on technology, though this is the first I have written a review for. I picked this book from NetGalley because I am interested in AI. I taught a course on AI one semester when I was in the Computer Sciences faculty of St. Edward’s University and wanted to see how much had changed. While the author includes considerable math in the book, it does not get in the way of understanding the material. After reading this book I certainly now have a better understanding of how the current plethora of Large Language Model systems work. I like the chosen cover art. I give this book a rating of 4 out of 5.
I have to admit I was expecting a bit more from this book. I appreciate when journalists dig into topics about technology and try to bring digested information to a wider public. But for me, it is missing a deeper knowledge. For example, the chapter about eigenvectors/eigenvalues, which I don't think is easy to understand, Anil mentions the cool concept of visualizing of what 2x2 matrix does with vectors by taking unit vectors distributed around a circle and showing how they are transformed when multiplied by that matrix. Then he picks the 2 of those multiplied vectors and says they are the eigenvectors without any explanation why he picked exactly those ones and not the others?
It would also be nice (in my opinion), since the neural networks (NN) are basically about finding the weights for the connections between the neurons, showing one simple example with the concrete weights and how it works for solving a particular problem (for example how to train NN to recognize if there is more black pixels on a 3x3 grid with pixels being only white or black - just making this up, but I am sure there are some simple examples where the whole NN with all layers and weights can be depicted). I thought this would be more instructive than outlining how to use NN to recognize digits written on a 28x28 grid (which is also cool).
Overall, I was quite disappointed by this book. But it is just me, I have a MS degree in math/physics, so maybe my expectations are higher compared to those who just want a basic intro into this topic.
A bit on the fence on this one. I wanted to love it given the reviews, and my career in the subject. I found the first few chapters a bit dull. It was more of history lesson, and I thought the math could have been explained better. It's really hard to describe high dimensional geometry that changes over time in a book, even an eBook. I so want the new books to have movies and be more interactive. Sorry this is my pedagogical pet peeve, so much is possible in new ways to teach but we still rely on centuries old technology.
The last 4 chapters though were outstanding. The history now made some sense, and I could ignore the math lesson. I liked that the author bought up the following - Artificial Neural nets while have some likeness to biological neural nets, especially Convolutional Neural nets and their corresponding structure in the visual cortex. - Clearly the way Neural Nets learn is not biologically feasible, biological neurons are not performing partial differential equations, and we don't need to see a million different cats to recognize what a cat looks like. - The discussion on how Neural Nets out perform what theory predicts, and that empirically they do things that seem impossible given the architecture. - The discussion on LLM's was some of the best out there.
I especially like that the author built a neural net in order to really understand how it worked. Indeed the best way to learn is just to do.
This is an extremely technical book about how machines learn, not really why they learn unless we're going to find why. As the actual circuitry pathways, which is seems to be what we're doing here, Why they learn is because we programmed them to learn off from your scripts. That it's like we made them to do this. That's why The how is actually really fascinating talking about mathematics and data set Using recursive systems, but actually mostly just parsing, Uh, and comparisons. So, it's actually very similar to like the other And analyzing how people think. That's why you get into this stuff where people are, like, oh hey, They don't think like people like, ah, well, they do quite often. Actually, You know, and that's why Large language models specifically, You can get in trouble because They are the accumulation of masses of thought. So, if you say that, there's a problem, you know, the problem it comes from within. Ah, people are scary. Who knew? I knew It's just that. There's so much moral posturing with the stuff, like, right, at the end here. There was a lot of moral posturing. I hated that. I like knocked that shit off. We had an excellent technical book here, And I probably bumped it down lower than I would have rated it otherwise Without the moral posturing at the end.
Straight up, People intentionally make more biased decisions against one another. Then, an llm will accidentally.
This is a nice math historical work centered around learning neural networks.
Tracing from algebra to geometry into calculus and ultimately into multi-dimensional multi-variate differential equations and finally into algorithm driven statistics and machine learning each step of the history connected back to deep learning neural networks in the 1960 with explanations on why we had AI winters we had prior to LLM.
PS. There are many more proofs than needed but he is a mathematician who loves math.
The book tracks these developments from 17th century to the modern day, holding a special focus around leaning systems and examples of current data scientists approaches towards machine learning and ultimately a discussion in the last chapter on LLM’s and how LLM use some of the previous explained math methods and the current inability of ML theorists to mathematically explain the lack of overfitting errors and training loss of LLM’s like ChatGPT.
I found it a fun read as I enjoyed remembering taking all this mathematics classes in undergraduate and appreciated the historical notes, not sure a non-engineer would feel the same, and personally with only one chapter at end on LLM I was disappointing and hungry for much more discussions on state of the art, but as such the book is more a math premier than an AI LLM tour deforce.
An Accessible and Beautifully Written Journey Through the Mathematics of AI
Anil Ananthaswamy has done something truly special with Why Machines Learn. In a field often dominated by jargon and overwhelming technicality, he offers a remarkably elegant and readable exploration of the mathematical principles that underpin modern artificial intelligence. This book doesn’t just explain what machine learning is — it illuminates why it works, and it does so with clarity, depth, and a journalist’s gift for storytelling.
What sets this book apart is its rare ability to blend rigorous concepts with intuitive explanations. Ananthaswamy takes readers through linear algebra, probability theory, optimization, and other foundational tools, not in isolation, but as they come alive within real-world AI applications. Whether he’s explaining how gradient descent mimics nature or demystifying neural networks, he makes complex ideas feel surprisingly accessible.
This is not a textbook, and it’s not just for data scientists — it’s for anyone curious about the logic that powers today’s intelligent systems. If you’ve ever wanted to understand the beauty behind the algorithms shaping our world, this book is a must-read.
Highly recommended for tech enthusiasts, students, and lifelong learners alike.
An interesting book if you want to start to dive into the weeds of how many of today's recommendation and search functionalities actually work and don't have any data science background. The author very much starts from first principles and has lots of useful diagrams for those unfamiliar with math, but the subject matter necessitates a pretty rapid speed into complicated vector math and proofs. Kind of a strange balance is being struck here. It might be a little overwhelming for those not super technically inclined, but hey math is right there in the subtitle. This isn't a casual popsci primer about machine learning or artificial intelligence, if that's what you're looking for.
We never really get into the "why" of ML or AI, it's much more the "how".
A lot of current discussion on machine learning now is dominated by large language models though many applications just need other types of ML, and this book really focuses on the others. It'll get you a foundation and comfortable with the ideas before you dive into deep learning or LLMs if you wish, though I'd consider those more "modern" as the field has already moved pretty quickly.
If you are already familiar with some of the mathematics involved -- Bayesian inference, linear algebra, optimization -- this book provides a great explanation of how the "black box" of machine learning can actually be expressed in math that has been around for a long time.
If all that math is new to you, this book might not be sufficient on its own to teach it to you, and THEN to explain how it is used in machine learning.
Still, it's a super valuable book. It moved me from being a skeptic of woo-woo ML to understanding the basis for much of it.
The most interesting part was the explanation of the lack of consensus around why adding more parameters past the point of overfitting actually results in better general results. That's exciting -- there is clearly new theory out there waiting to be discovered.
The worst part was when the author rejects conventional mathematical notation in favor of his own, much worse, notation. Subscripts are good, actually, and he should use them. And hats on variables should be used for expected values.