Jump to ratings and reviews
Rate this book

Programming Collective Intelligence: Building Smart Web 2.0 Applications

Rate this book
Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it. Programming Collective Intelligence takes you into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general--all from information that you and others collect every day. Each algorithm is described clearly and concisely with code that can immediately be used on your web site, blog, Wiki, or specialized application. This book
Each chapter includes exercises for extending the algorithms to make them more powerful. Go beyond simple database-backed applications and put the wealth of Internet data to work for you.

"Bravo! I cannot think of a better way for a developer to first learn these algorithms and methods, nor can I think of a better way for me (an old AI dog) to reinvigorate my knowledge of the details."
-- Dan Russell, Google

"Toby's book does a great job of breaking down the complex subject matter of machine-learning algorithms into practical, easy-to-understand examples that can be directly applied to analysis of social interaction across the Web today. If I had this book two years ago, it would have saved precious time going down some fruitless paths."
-- Tim Wolters, CTO, Collective Intellect

360 pages, Paperback

First published January 1, 2002

237 people are currently reading
2184 people want to read

About the author

Toby Segaran

7 books8 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
545 (37%)
4 stars
559 (38%)
3 stars
282 (19%)
2 stars
56 (3%)
1 star
15 (1%)
Displaying 1 - 30 of 96 reviews
Profile Image for Steve.
79 reviews24 followers
November 7, 2007
This is a beginner's guide to machine learning techniques. In typical O'Reilly fashion, there's very little math but lots of code snippets. While you will learn some motivation for using various techniques, you won't be able to start actively using them with just the overviews in this book.

There's no chapter on Support Vector Machines, just a section on using libsvm, a library that implements SVM. They said an in-depth discussion of SVM was beyond the scope of the book. I strongly disagree; They do discuss Neural Networks somewhat in-depth and, in my experience, NNs have been abandoned for SVMs so why not seriously discuss SVMs?

I would have liked to see the section on Neural Networks and Genetic Programming removed entirely and chapters on input and output choices added as they are really at the heart of what you do day-to-day with machine learning: for most uses of machine learning, it's 90% data munging, 5% algorithms.

So overall, I think this book is a good starting place for beginners with no experience in machine learning but they should expect to quickly move on to more advanced books in order to start actually using the techniques touched on in this book. Which brings up one last point: for a beginner's book, there was no bibliography. That was very surprising.
Profile Image for Jean-Luc.
278 reviews35 followers
November 23, 2012
5 years ago, this may have been *the* book for the aspiring Artificial Intelligence practioner. It hasn't held up as well, or maybe I'm just a lazy whiner, but this book requires far more effort than normal. The libraries the code references have since been updated, and in some cases completely rewritten, so the code samples are sometimes out of date in non-obvious ways. The confirmed and uncomfirmed errata must be kept open in your browser at all time.

Chapter 12 (the summary) should be read before reading the rest of the book. Most of it won't make sense on the first viewing, but that's the point: you'll have a better feel for what to pay attention to in the other chapters.

Recommended for anyone who enjoys programming in Python.
Profile Image for Costin Manda.
670 reviews20 followers
February 28, 2019
Programming Collective Intelligence is easy to read, small but concise, and its only major flaw is the title; and that is because it is misleading. The book touches quite heavily on using collective information and social site APIs, but what it is really about is data mining. It may not be a flaw with the majority of readers, but personally I wouldn't care about the collective, the Facebook API or anything like that, but I was really interested in the different ways to analyse data. In that sense, this book can be taken as a reference guide on data mining.

Each algorithm and idea is accompanied by Python sources. I personally dislike Python as a language, but the author affirms he chose it intentionally because the algorithms look clear and the source is small, with its purpose unhindered by many language artifacts. The book was so interesting, though, that I plan (if I ever find the time :( ) to take all the examples and do them in C#, then place them on Github.

The book covers classification and feature extraction, supervised and unsupervised algorithms, filtering and discovery and it also has exercises at the end of each chapter. Here is a short list:
- Making Recommendations - about the way one can use data from user preferences in order to create recommendations. Distance metrics and finding similar items to the ones we like or people with similar tastes.
- Discovering Groups - about classifying data into different groups. Supervised and unsupervised methods are described, hierarchical clustering, dendograms, column clustering, K-Means clustering and different methods of visualization.
- Searching and Ranking - it basically explains step by step how to make a search engine. Word frequency, word distance, location of a document, counting methods, artificial neural networks, the Google PageRank algorithm, extraction of information from link text, and learning from user clicks can be found in this chapter.
- Optimization - simulated annealing, hill climbing, genetic algorithms are described and exampled here. The chapter talks about optimizing problems like travel schedules and the example uses data from Kayak.
- Document Filtering - a chapter about filtering documents based on preferences or getting rid of spam. You can find here Bayesian filtering and the Fisher method.
- Decision Trees - a very interesting method of splitting information items into groups that have a hierarchical connection between them. The examples use the Zillow API
- Bulding Price Models - k-Nearest neighbors, weighted neighbors, scaling.
- Advanced Classification - Kernel Methods and Support Vector Machines. This is a great chapter and it show some pretty cool uses of data mining using the Facebook API
- Finding Independent Features - reviews Bayesian classification and clustering, then proposes Non-Negative Matrix Factorization, a method invented circa the late 90s, a powerful algorithm which uses matrix algebra to find features in a data set
- Evolving Intelligence - bingo! Genetic Programming made easy. Really cool.
- Algorithm Summary, Third Party Libraries and Mathematical Formulas - if you had any doubts you can use this book as a data mining reference book, the last three chapters eliminate them. An even more concise summary of the methods explained in the book, listing every math formula and obscure library used in the book

Conclusion: I really loved the book and I can hardly wait to take it apart with a computer in hand.
Profile Image for Michael.
Author 8 books593 followers
February 28, 2008
If it works, it's not AI.

Segaran's book is getting a lot of buzz right now and for good reason. It's a great survey of some common classification and recognition techniques useful for providing critical services associated with "Web 2.0". The explanations and code are easily understood, which says a lot for a book with this subject matter. What is the subject matter? I think it's safe to say that it is not AI but instead statistics. There are certainly touches on classical AI techniques, but they are done in a practical manner (which makes me wonder if they can be classified as AI due to that fact). Long story short, it's a great read, it's presented in such a way that will appeal to hackers, mathematicians, or your run of the mill programmers.
-m
This entire review has been hidden because of spoilers.
Profile Image for Wael Al-alwani.
42 reviews15 followers
November 10, 2011
This book was extremely helpful in refreshing my knowledge in many topics I came across in the fields of machine learning, data mining, and optimization. 5 stars to this book for being easy to read and well written, presenting some really sophisticated concepts in a very neat way, and finally putting all these concepts along with interesting ideas and examples all in one place. Some said that many explained techniques are not very useful anymore with the excessive loads of data the nowadays-applications are dealing with.. I say this is true but I think that was out of the scope of the book. Anyways, enjoy reading this book,, it is really a great book (I admired the last chapter, #12, which listed all the aforementioned algorithms along with their uses and their pros & cons).
Profile Image for Will Johnson.
11 reviews22 followers
April 15, 2013
I'm not exactly who this book was for.

The problem was twofold.

1.) There were a lot of errors in the book. O'reilly's unofficial errata is filled with examples of where the code is incorrect or output in the book doesn't match the actual output you should receive.

2.) The statistical concepts are kind of brushed over. If this book is for programmers wanting to learn about collective intelligence, then it did a poor job in conveying the algorithms. An algorithm / method was introduced without much explanation of how it works so you end up just copying the code without really knowing how you can extend / modify it for your purposes.
Profile Image for Luiza.
219 reviews5 followers
September 23, 2014
Good practical guide for a first contact with analytics, but does not go too deep on the explanations. It's much better for coding examples and to see results quickly, but most of the times you feel there's something missing on the explanations. The book also needs a good revision, since some of the APIs described are not available or had changes in the last years. The links provided by the book are also broken, would be much better if the author had used tools like bit.ly for the URLs.
Profile Image for Alex Ott.
Author 3 books207 followers
August 15, 2010
Very good introduction into machine-learning, information retrieval & data mining related questions. Could be used to get high-order overview of corresponding topics, especially by non-CS peoples.
Profile Image for Georgi.
45 reviews
February 12, 2015
Too much focus on data scraping at the expense of algorithmic/mathematical theory.
Profile Image for Kyle.
407 reviews
April 17, 2020
I bought this book a couple of years ago as machine learning and deep neural networks were becoming the big news in the smart algorithms world, and while it was a bit old even then, the concepts have aged well. I had hoped for a bit more answering of why the more complicated algorithms can be expected to work, but this book was not written for that audience. Instead, it explains when the algorithms can be used, and how to implement and use them appropriately. This is an important thing to know, as well, and the book is much more on the applied than theoretical side.

The biggest problem with the book are the incorrect lines of code that pop up every once and a while. This is annoying, and at least one of the errors was incredibly subtle (importing pylab overwrote some of the random functions from the library random). The errata onlin has some helpful answers, but also is full of things that are not.

Given the book's age, it also is written with python2. Python3 is much preferred now, and pretty much all of the libraries used in the book are available for python3 (though they sometimes change names). This requires minor translation efforts. The major problem is that many of the API relying on websites on the internet no longer exist. This means some of the exercises are not possible, or you will have to use some other API. Pandas can make up for some of these deficiencies.

Overall, I found the book enlightening and am glad it forced me to actually write out and get the code working for myself. Experience is a great teacher and if you go through what's given in the book, you'll solidly understand the basics as well as be able to use many algorithms (genetic algorithms, naive Bayesian classifiers, recommendation methods, optimization, some database ideas) that should be useful in commonly faced problems. I did not go through many of the exercises at the end of the chapters, but most of them are straightforward applications are extensions of the ideas in the book, and seem like they would provide a decent challenge.

If you'd like to understand the basics of the above ideas, and have code to use them on data with, then this is a fairly good choice. I would recommend looking around to see if people recommend a newer edition or book with similar ideas, however, since it is a book written over a decade ago. On the other hand, the explanations and figures are helpful and do not over-complicate things.
130 reviews2 followers
June 18, 2021
honestly, i skimmed most of this after reading the first couple chapters (see my complaints below). advice: read chapter 12 first and see what interests you and then MAYBE go back and dig into the other chapters.

fair warning: this book is woefully out of date, the code examples are poorly formatted (authors never heard of PEP8 i guess?), difficult to follow (bad variable names, math behind code is not well explained or not explained at all), many of the actual data sources used just simply do not exist and so there are entire sections of the book that are simply useless (unless you want to scrounge & then munge your own dataset to shim it into the format expected so you can fill the gaps).

please note, this text is highly introductory. with that caveat, this book seems like a very nice overview of some basic questions a person might ask about different kinds of data or problems and prototypical solution concepts.
1 review
October 16, 2010
Programming Collective Intelligence (Segaran, 2007) uses a multitude of examples to show how data can be combined and analyzed to produce results that are “more human.” The book intersperses text with Python programming snippets. The programming code allows someone to work through all of the examples discussed in the book. At times, some more advanced examples require additional library downloads, but everything in the book is accessible to the reader.

The book covers a wide range of topics related to data analysis. It begins with a simple algorithm that recommends movies based on your previous movie reviews and the movie reviews of others. Although this was the easiest task within the book, I felt that is was one of the most powerful examples. What is powerful about this chapter is that the mathematics behind the programming was very simple. I think this illustrates the power of the Internet and Web2.0 systems. Sometimes the analysis of the data is very easy.

I also think this chapter related to movie recommendations also points to some of the frailties of data mining. The results are only as good as the data that has been collected and analyzed. Thinking of my own personal experiences with movie and music websites that make recommendations, I know that we still have a long way to go to improve the accuracy of these systems. I think the algorithms behind the programming are sound, but I think we are missing some critical components in the collection of the data. There is something very personal about certain datasets that I believe we are still missing. I don’t doubt that we will eventually become more accurate, but I think we still need to find more indicators to include with the datasets.

I also think a powerful statement was made in Chapter 9 when the author stated that, “An important thing to take away from this chapter is that it’s rarely possible to throw a complex dataset at an algorithm and expect it to learn how to classify things accurately. Choosing the right algorithm and preprocessing the data appropriately is often required to get good results” (p. 197). This is a precursor to the chapter related to “Matchmaking” using advanced classification strategies. Throughout the chapter, Segaran talks about the raw data and discusses ways to restructure and normalize the data. I think this is important. For example, converting street address data to discern actual mileage difference between two points, and grouping interests into categories (e.g., snowboarding and skiing). Without this type of preprocessing, comparisons are limited.

Most reviews of this book focus on the fact that it is a balance between programming and mathematical computations. There is a great deal of code on almost every page, but there are little mathematical explanations in terms of formulas. For advanced mathematicians, most of the mathematics used in the book is probably already known, so the formulas may not be needed. In general, I would have liked to see some more theoretical discussions of the topics and perhaps the inclusion of more detailed information related to the mathematical formulas. I believe that this would make the process of applying examples in the book to other datasets a little easier.
Given my minimal programming experience and minimal mathematics experience, I found that the Python code made the book confusing at times. I was not interested in running the programs as I was reading the book, so I found myself trying too hard to decipher the code. If I focused on the text, I was fine. I did find the explanations, tables, and diagrams to be extremely interesting. I have never thought about the process behind search engine rankings, spam filters, or optimization used in recommending the best travel itinerary; however, the book did an excellent job explaining these concepts.

I think that the prospects of connecting datasets to mine data and produce “intelligent” results are particularly powerful. In my profession (K-12 education), I could see these concepts being used to analyze assessment data and make instructional decisions for individual students. I have already seen certain products that attempt this, but I have not seen anything that does a thorough job. Many schools currently assign students to remedial classes or activities to try to increase student performance. If a web application could model using decision tree logic as discussed in Chapter 7, schools could identify the student groups that need particular help in certain areas. I think this type of prescriptive-teacher would be very beneficial. Of course, all of this depends on the accuracy, specificity, and validity of the assessment tools. Education has wrestled with this concept for a long time.

My overall rating of this book is a 4.5 out of 5. Even though some of the concepts and programming were above my head, it caused me to rethink my pedestrian VBA projects and how I could use concepts within this book on future projects. For me, I will need to do much more research to learn and implement these concepts, but I do not think I was the intended audience for this book. I think the best audience for this book would be a programmer that has minimal experience with working with live Web 2.0 data. For a person with preexisting knowledge of programming, and a solid background in some advanced mathematics, I believe that this book would really open the doors to creating interactive websites or applications that use scraped data to enhance an end-user’s experience.
12 reviews1 follower
Read
August 2, 2019
I've started getting acquainted with machine learning with this book. It covers basic ideas from the ground up and doesn't rely on knowledge of statistics and deeper math. The author is not using Python ML ecosystem and builds all algorithms from the start - which is pretty good if you want to understand the internals of the algorithms. The book covers recommendation systems, classifiers, clustering, and regression models as well as less obvious searching and ranking, optimization and genetic programming.
3 reviews
May 22, 2017
great intro to ML/AI algorithms, having worked through the code I can tell you it's worth it, but have the errata page handy on O'Reilly's website as there are often slight mistakes or tweaks. My favourite was chapter 11 genetic programming and chapter 4 for web search engines, bit outdated in places
435 reviews18 followers
August 14, 2024
I am definitely not the target of this book. I would say however, after skimming a bit through it, it is pretty much obsolete in 2024. Some of the techniques could still be relevant, but as it is supposed to be an hand-on introduction, the code snippets refer to obsolete libraries and frameworks, are often not clear and sometimes not very accurate, with some mistakes.
Profile Image for Mridul Singhai.
50 reviews12 followers
August 22, 2018
Slightly outdated for today's times, but still does a good job at describing the practical techniques required to make small features for a pet web application without all the morbidity that surrounds today's age of statistical inference.
Profile Image for Bigpapa44.
13 reviews
July 2, 2017
It's a good book, though it has some little mistakes.
I'll read it again to fully understand it.
24 reviews10 followers
May 7, 2019
Amazing read, very captivating. Nice learning curve, no sudden unexplained jumps.
Profile Image for Cameron.
76 reviews6 followers
July 26, 2020
pretty good intro I learned some stuff
8 reviews
March 16, 2017
If you're looking for a great starting point to learning about machine learning & data analytics, this is it. Toby Segaran does an excellent job of explaining the concepts behind collective intelligence then walks your thru the process of writing code to capture/analyze big data sets. I've read this book multiple times and still refer back to it. Highly recommended.
Profile Image for Thomas.
Author 1 book57 followers
August 21, 2021
After the author's recent passing, I dug up this review I wrote on Amazon, but apparently never shared here. The original date of this review was September 7, 2007. Sidenote: at the time of this review, I was quite the Ruby fan and didn't care much for Python. Somehow. in the years since, that attitude has done a complete reversal.


I first learned of this book just a few weeks ago, shortly before it was available. I immediately read the sample chapter on the publisher's website and was certain I had to get a hold of a copy.

I was not in the least bit disappointed with what I found. It has been quite a while since I've looked at any Python code (I'm more of a Ruby fan, personally), but the code is easy to follow and it's a simple matter to extract the basic concepts into any language.

I have spent quite a few years now watching the field of machine intelligence from the sidelines, occasionally reading the odd technical write up or wikipedia article, trying to wrap my brain around the basic ideas. The thing is, it's not clear to me that in some regards, it's not that complex. It's just that most of the existing books and articles are written for those immersed in the field. This book is not like that. It explains things in clear language that is easy to follow, using simplified examples and making excellent use of graphics to "show" you how it works.

If you really want to dig in deep, Segaran provides exercises at the end of each chapter and gives you an appendix full of mathematical formulas (the "pure" representation of the algorithms).

Finally, I should mention that the last chapter does what so many other technical books should but don't: it clearly summarizes everything he has shown you. He does this in a straightforward way so that you won't have to go searching through the book, rereading everything again, to put these techniques into practice.
Profile Image for Wai Yip Tung.
31 reviews15 followers
January 1, 2013
When this book fist come out in 2007, it generates quite a thrill. For many programmers like me, this opens a door to the world of machine learning. The book introduces a range of machine learn algorithm solving problems such as classification, clustering and optimization by learning from data and making statistical inference. There is little theory or mathematics used. Instead the emphasis is on program code. It does come up with simple but practical data set so that the algorithm makes intuitive sense.

I have taken a circuitous course with this book. After I ordered it when it first come up I strive to work through all the code example instead of reading passively as I would do with a normal book. While this is an excellent idea to be as hands on as possible, it also turned the reading into a project. Unfortunately I'm doomed because it became one of the many projects I always want to do but was mostly shelved. It is not until 2012 when I finally finished reading the whole book.

In the mean time, I have learned a lot of the machine learning topics separately, often reference to much more theoretical materials. When I come back to Programming Collective Intelligence, I no longer find it as exciting as it once was. Reading source code is actually not a great way to learn the underlying mathematical model. And without the mathematical model, one cannot fully appreciate the insight from the method.

That say this book distinguish itself from other formal work in an very important way - it has running code. Not only does the code work on some toy models, it also mean to show you how to pull data from available web services or screen scrapping. This is a great complement to theoretical book because it gives you an experiment kit that you can try out and tinker so easily.
Profile Image for Guilherme d'.
10 reviews7 followers
February 9, 2017
The book is quite good to get a general idea how some common algorithms work, and does that in a very nice way. The biggest problem for me, which is not a fault in the book, is that most os the materials that the book uses to teach the algorithms are not available anymore or are outdated, so in the end I end up reading the book only, without applying the code. Maybe the newest edition doesn't have this problem.
18 reviews2 followers
June 25, 2008
This book is a survey of machine learning algorithms useful for tasks like spam filters and recommendation engines. It's a great book if you're a practicing programmer that want to get thing done, less great if you're looking for a deep exploration of a particular topic.

There's a few things I liked about it. The most important feature of the book is its breadth. It covers a variety of useful algorithms, from more well known techniques (Baysian filters) to recent developments (support vector machines).

Another thing I liked about this book was the example problems all use real-world datasets. RSS feeds, the Facebook API, and live flight searches are all used in various places.

One excellent feature of the book is the example code builds at each step. Generic algorithms written in earlier chapters are reused or extended in later chapters. In addition to being good software design, this lets the reader see how a concept (e.g. optimization) can be applied in different contexts.

The book makes liberal use of the interactive interpreter to explore the algorithms, and the code itself is fairly readable (though not fantastic).
Profile Image for Tom.
88 reviews11 followers
October 18, 2009
This is an incredibly useful book for all those who are looking to divine intelligence with data collected through their web apps. Segaran mixes equal parts math, theory and practice in a way that keeps the reader's attention while introducing a number of somewhat complex machine learning topics. Python was a wise choice for the example programs as well.

This book does not need to be read in order. In fact, my humble recommendation is to read the introduction in Chapter 1, then skip to Chapter 12, Algorithm Summary. IMO Chapter 12 is the gem of the book--it does an excellent job summarizing the supervised and unsupervised learning techniques discussed in Chapters 2-11, as well as showing the strengths and weaknesses for each approach. After finishing Chapter 12, I would read Chapter 2, since the distance calculation methods are used throughout the book, and after that, any other chapters that peak your interest.

But whatever you do, READ THIS BOOK! If you are a software developer working on the web, and you have mastered the basics of web programming, this is a must-read.
Profile Image for Noah Sussman.
12 reviews6 followers
June 22, 2010
I lost my copy of this book, which is too bad. This book was one of my first books about application building, as opposed to User Interface or general Computer Science. There's a lot more math than I'm used to -- every example so far contains a mathematical function.

So far I've seen how to calculate movie recommendations from a list of critics' ratings. This is something I never thought about doing before, and it's surprisingly easy -- basically comes down to plotting the different ratings as points on a graph.

One thing I'm already appreciating is the examples of how to manipulate "dictionaries." A dictionary is a Python data structure (all the example code is Python) similar to a hash. It's also similar to the hash-like JSON objects that I commonly use when I write JavaScript UI widgets. Segaran demonstrates how to build dictionaries, and then how to transform them in order to get a different perspective on the result set.
Profile Image for Josep.
24 reviews
June 18, 2012
This book does a good job making an introduction of machine learning technologies to the average programmer. This is its main merit. Having said that, the introduction to the subjects is very simplified, so you'll need further reference to actually implement anything at all. It's full of Python code snippets only work to make the subject appear accessible to the programmer, and look like waffle to me. Mathematical formulas in which code snippets are based can only be found (without further explanation) on an annex. Algorithm alternatives or optimizations for real-life operations are not described. But in the end, this is the objective of this book. My main criticism would be that the book doesn't fully succeed at explaining exactly what you should use each technique for and which are their pros and cons.

To sum up, read this book if you are a programmer, you have no previous knowledge on machine learning topics and you want a math-free introduction on the subject.
Displaying 1 - 30 of 96 reviews

Can't find what you're looking for?

Get help and learn more about the design.