# Why Machines Learn **Anil Ananthaswamy** | [[Numbers]] ![rw-book-cover](https://m.media-amazon.com/images/I/71aqypviIzL._SY160.jpg) --- > "A computer, then, is a dynamical system, one whose behavior can be seen as evolving, or transitioning from state to state, with each tick of the clock." This is the mathematics of machine learning made accessible. Not dumbed down—genuinely explained. Ananthaswamy traces the conceptual history from Rosenblatt's perceptron through support vector machines to deep learning, revealing the mathematical ideas that make each approach work. The core insight: machine learning is fundamentally about finding boundaries in high-dimensional space. Whether you're classifying images, predicting outcomes, or generating text, the underlying mathematics involves finding hyperplanes that separate categories of data—and the cleverness is in how you find them. --- ## Core Ideas ### [[The Perceptron and Linear Separation]] The perceptron, devised by Frank Rosenblatt in the late 1950s, finds a hyperplane that separates data into two categories. A hyperplane is a generalisation of a line (in 2D) or a plane (in 3D) to higher dimensions. If your data can be cleanly divided by such a boundary, the perceptron will find it—guaranteed by the perceptron convergence theorem. The problem: most interesting data isn't linearly separable. You can't draw a straight line between cats and dogs in pixel space. This limitation, proved elegantly by Minsky and Papert in 1969, triggered the first AI winter. ### [[The Kernel Trick]] Support Vector Machines solve the linear separability problem through a beautiful mathematical manoeuvre. Take data that's inseparable in low dimensions and project it into higher dimensions where a separating hyperplane exists. The kernel trick lets you compute in this high-dimensional space without ever explicitly transforming the data—keeping calculations tractable. The optimal hyperplane isn't just any separator. It maximises the margin—the distance between the boundary and the nearest data points (the "support vectors"). This margin maximisation is what makes SVMs generalise well to new data. ### [[Backpropagation and Deep Learning]] Multi-layer neural networks can learn non-linear boundaries, but training them requires knowing how to adjust millions of weights. Backpropagation, published by Rumelhart, Hinton, and Williams in 1986, solved this by propagating error signals backward through the network, using the chain rule from calculus to compute how each weight contributes to the final error. This breakthrough ended the first AI winter and set deep learning in motion. The algorithm itself is pure calculus—but applying it to networks with many layers required both mathematical insight and computational power that only became available decades later. --- ## Key Insights **Most machine learning is inherently probabilistic.** Even algorithms not explicitly designed to be probabilistic end up making probabilistic predictions. The outputs are confidence levels, not certainties. **Bayesian thinking underlies modern ML.** Most of us are intuitive frequentists—we think probability means counting how often things happen. But the Bayesian approach, which updates beliefs based on evidence, is extremely powerful and underlies much of machine learning. **The AI winters were caused by mathematical limitations.** The first AI winter (late 1960s–70s) happened because single-layer perceptrons provably couldn't solve important problems. The second (late 1980s–90s) happened because the hardware couldn't support the algorithms that worked in theory. Both were eventually overcome. **Constrained optimisation is everywhere.** Finding the best solution subject to constraints is the core mathematical problem in machine learning. Lagrange multipliers, gradient descent, and related techniques are the tools. **Computing is state transition.** A computer moves from state to state according to prescribed rules until it reaches an end state. This dynamical systems view of computation connects machine learning to physics and information theory. --- ## Connects To - [[Everything Is Predictable]] - Bayesian reasoning as the foundation of learning from data - [[Prediction Machines]] - the economics of what cheap prediction enables - [[Algorithms to Live By]] - computational thinking applied to human decisions - [[Why Information Grows]] - information, computation, and physical systems --- ## Final Thought Machine learning isn't magic. It's mathematics—specifically, the mathematics of finding patterns in high-dimensional space. The perceptron finds linear boundaries. SVMs project data into spaces where linear boundaries exist. Neural networks learn non-linear boundaries through layers of simple operations. Understanding the maths doesn't make you a practitioner, but it does demystify the field. When someone claims AI "understands" something, you can ask: what hyperplane did it find? What data defined the boundary? What's the margin of error? These are mathematical questions with mathematical answers. The book traces a sixty-year arc from Rosenblatt's perceptron to modern deep learning. The mathematics evolved, the hardware caught up, and suddenly machines could learn. But the fundamental ideas—probability, optimisation, high-dimensional geometry—were there from the beginning.