Rate this book

Bad Data Handbook

Name: Bad Data Handbook
Rating: 3.56 (23 reviews)
ISBN: 9781449321888

Q. Ethan McCallum

Rate this book

What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they’ve recovered from nasty data problems. From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way . This book explains effective ways to get around it. Among the many topics covered, you’ll discover how

GenresProgrammingComputer ScienceNonfictionTechnologyComputersBusinessReference

262 pages, Paperback

First published January 1, 2012

24 people are currently reading

310 people want to read

About the author

Q. Ethan McCallum

5 books4 followers

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

23 (19%)

4 stars

41 (35%)

3 stars

35 (29%)

2 stars

15 (12%)

1 star

3 (2%)

Displaying 1 - 23 of 23 reviews

Bastian Greshake Tzovaras

155 reviews89 followers

September 13, 2013

While the title is misleading - this is no handbook in any traditional sense of the word - it was a quick and fun read, for the at least a bit statistically inclined which has to work with data in a broader sense.

The book is a selection of individual essays which all deal with bad data in a way or another. It's really nice that all the contributors come from different fields and thus bring many different perspectives on the topic. You get the typical big data stories, stock market data, web server logs and even some social science survey stuff and in the end the lesson is always ± the same: Don't take your data for face value, do sanity checks to see where problems are, etc.

If you haven't thought about bad data too much before you can probably at least get an idea of what issues you should keep in mind. If you already have dealt with your fair share of bad data you may find yourself reading something that resembles a self help group (in a good way).

For me the most rememberable story is how Google News re-indexed a 6-yo story about the bankruptcy of United Airlines due to the lack of proper meta data, which ultimately lead to halting the trading.

(Make sure to bring some rudimentary bash/sql knowledge to understand some of the chapters)

non-fiction

Philipp

688 reviews222 followers

January 1, 2015

A nice collection of essays on bad data, focusing mostly on bad data in companies and start-ups. I think only one essay handles scientific data. As all chapters are from different authors, there's sometimes a bit of overlap in between, and the "fun" reading them varies wildly.

Recommended for: People working with data for their job

programming

Robert Postill

128 reviews17 followers

March 26, 2013

When I saw this book I was hooked. As I work in Business Intelligence on a product that weeds out bad data I figured I was slap bang in the target market. I was right. While a lot of emphasis is put on the data scientist role within the book it has clear tones of business intelligence in it. Not to mention some discussion specifically mentions data warehouses, reporting and the like.

The book is really a collection of essays on a spectrum from non-technical through to technical. Earlier essays are much more practically oriented than later essays. As such it suffers from the classic issues of an essay collection. Namely there's no one voice in the book (which weakens it's overall authority) and that the whole is dragged down by it's weakest elements. It's fair to say that O'Reilly have really pushed towards the R, Hadoop, big/unstructured data crowd and that shows in the maturity of the commentary which is sometimes a little breathless. Having said that the book is refreshingly clear of the stodginess associated with much of the BI canon.

Some high points of the book for me were the essays Blood, Sweat and Urine, Bad Data Lurking in Plain Text and Detecting Liars and the Confused in Contradictory Online Reviews. The Blood, Sweat and Urine essay particularly was both human and useful. Bad Data Lurking in Plain Text mirrored my pain in dealing with plain text and was a very concise overview of a troubling area. Finally Detecting Liars and the Confused in Contradictory Online Reviews was an excellent experience report. An honourable mention should go to Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough which was a good way to close the book.

However the lows were pretty low and knocked at least one star off the review. Social Media, Erasable Ink? Was a hyperbolic essay whose point seemed overly laboured. Spoiler, your public data sops being yours after you transmit it to the service to publicise it. Also Don't Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad was vague and wish-washy in the extreme.

Overall, check it out if you work with data, you'll spend a lot of time nodding your head at least. The practical chapters trump the governance chapters in my opinion but that just reflects the relative maturity of the communities represented in the book. Sadly though this book won't be a classic of the field in it's current form. I hope O'Reilly knock out a second edition without some of the poorer chapters, as it would be nice to see this book evolve.

This entire review has been hidden because of spoilers.

bi-im computing got-it

Aaron Schumacher

203 reviews11 followers

January 27, 2022

I like this book quite a lot. It's a collection of chapters by different authors, and reads something like a series of excellent blog posts. With the exception of chapter 18, it's quite good. It covers a lot of the issues that arise in practice when gathering and starting to work with data. The explanation of text encoding in chapter 4 could be the best I've seen, and chapter 14 ("myths of cloud computing") is something I wish a lot of people who present themselves as "cloud experts" would read and understand. Philipp K. Janert, author of Data Analysis with Open Source Tools, contributes a very nice chapter as well.

The book closes with a "framework" for data quality, with these "four Cs":

* Complete
* Coherent
* Correct
* aCcountable

It's not bad, this book. I'd recommend it to anyone who needs to work with data in the real world. I think there's room for even more theory and practice of data cleaning; I'd like to see an even better book yet!

have

Venkatesh-Prasad

223 reviews

February 14, 2020

Must read for any SE (or body) who uses with data.

Great observations and advice in the era of big data and cloud computing. While the title mentions "bad data", the book is not only about curatively handling bad data. Instead, it is also about preventing the creation of bad data.

For a reader who has been dealing with data, most of the observations and advice in the book should trigger the response "couldn't agree more". If the response is "Really?", then the reader should take a really hard look at how she handles data.

I was really glad to read the chapter on the myths of cloud computing as it confirmed many of my own views about the limitations of cloud computing that I suspect people often don't consider in the context of cloud computing. The book is worth solely for this chapter.

Jeroen Delcour

10 reviews

August 16, 2017

A good introduction on how to handle bad data or bad data sources and get an estimate on the viability of your goal given the data quality. Doesn't go into great depths, so nothing more than an introduction, but has some valuable lessons for beginners. It's a collection of mostly anecdotal stories and the lessons learned from them. Some chapters are definitely better than others, so pick and choose which ones seem most relevant to you.

Alena

63 reviews7 followers

July 28, 2020

Not a book, but a collection of essays on the topic. The second half was much more interesting to me.

en it

Shawn

175 reviews6 followers

November 18, 2017

A great concept for a book. In this day and age as we seem to be increasingly engaging with things we call datasets, engaging in challenges to make sense of big data and engaging with one another around stuff we call data - here are a series of lessons to deal with data ... Taking a very case-oriented approach, the collection of articles in this edited volume look at the problems we run into - either overtly or unawarely when working with data. How many have run into the character encoding challenge, received data in a semi-structured form and needed to transform it quickly and efficiently into something more usable, or had to determine a means to identify the potential bias or results from collection errors? Well, that's what the Bad Data Handbook is all about.

Editor, Q. Ethan McCallum has assembled an impressive array of contributors who present articles on determining data quality and detecting potential flaws, fixing data errors to make it usable for your specific usage, and using the most up to date techniques and methods available today to tame data and effectively interrogate it for analytical purposes. The precept of this book is data not fit for purpose ... or at least the purpose you might have in mind for it and in that respect, we will call it bad data. The various chapters look at doing 'sniff' tests' on the data to see whether it is sound for the purposes you might consider putting it to. How do we find outliers? Can we spot gaps? through the use of some handy automated routines. The second chapter looks to techniques useful for transforming data that was formatted for human consumption and provides means to transform it to useful for machine readability. Subsequently the authors explore ways to consider the data models that have been used to define the collection and processing procedures that may or may not render data unfair for purpose.

The collection of articles in this book are deadly valuable and the solutions proposed are code based. The routines for dealing with the data ultimately involve application of routines to make data suit your needs. The routines are python-based so about as approachable as possible by users who may be less familiar or accustomed to using code to deal with data problems.

I was particularly impressed by the inclusion of a section on working with various text encoding formats and apply techniques to remedy situations which render the data 'bad'. The inclusion of a series of quick exercises in this section are particularly apt.

The general presentation of the book is to identify a specific problem, explain its significance and then to provide hands-on examples of how a user can approach a solution.

The transition to applied techniques to look at data from a more broad basis, such as using sentiment analysis and Natural Language Processing to sniff out whether online reviews are genuine or not addresses real world problems with online information - more than data itself.

This is an intriguing book. It looks at the down and dirty manipulation and mungingg of data, then takes higher level looks at how we might mistake information for solid data. In all cases it applies good techniques, suggests how one can use sound statistical reasoning, interrogate the data model or delve into code based manipulation in the pursuit of more truthful data. Due to the broad coverage of this book it is harder to determine who it is directly aimed towards. I believe that selective reading of it could inform general practitioners in the digital humanities and in emerging areas of study increasingly engaging with data in new ways. It brings to light many lessons of experience that are simply invaluable and would normally be developed only through hands-on tinkering and discovery often well into larger projects.It has broader appeal to data scientists more broadly who benefit for similar reasons, but also for the wealth of hands-on techniques provided that refine and empower standard practice.

In any case I do feel that as a collection of it articles it can a very helpful reference source and individual sections consulted as needed - by no means does is this a linear designed volume. It is however, a very valuable contribution to a field that is gaining mass popular engagement.

Jac

482 reviews

December 6, 2012

I really wish I'd picked this up before I started my web scraping project - lots of tidbits that I had figured out for myself the hard way, and some that I hadn't yet been bitten by but when they were brought up I could see where it was going to happen. Definitely glad I read it before starting my second project!

Some of the essays were more immediately useful, some more abstract, but all of them gave me at least one idea that I was glad to have come across. Strongly recommend it for anyone who might accidentally or deliberately end up collecting, storing, or analysing data. I'll probably try and come back to skim it again in six months.

borrowed kindle software

Cliff Chew

121 reviews10 followers

May 1, 2016

This book is quite a light read. The book is like a group of people sharing their anecdotal experiences in data management and analytics. You wouldn't get very complex algorithms or analytics from this book. But you will get useful, practical insights about how to manage your data, what things you should look out for from various "war stories".

If you want to work in a data-centric start-up, consider reading this book! If you are working in a data-centric start-up now, you will be able to attest to many of the experiences shared, and maybe even learn a few tips on how others mitigate the data issues that you are facing daily at work!

Nick

125 reviews9 followers

June 30, 2014

I actually bailed on the last few chapters of this book, having not gotten a whole lot out of the first several chapters. I did learn that 'file' will detect encodings, and that there's a python one-liner to server the pwd over http, but most of the rest of the actual information about working with data wasn't really new to me. Ah well. This is not to say it's a bad book, it's just fairly introductory. Also, it's sorta like a conference proceedings, with a bunch of independent chapters, which I don't tend to like.

Jindřich Mynarz

120 reviews17 followers

April 27, 2013

First few chapters are of poor quality, so I recommend skipping them. There are also some good, yet irrelevant chapters, such as one on cloud computing. However, overall I liked the book, mainly because of chapters including Pete Warden's text sharing real-world experience with data processing for machine learning or Richard Cotton's suggestions for automation and validation for dealing with bad data. Don't be discouraged by the start of the book a give it a try.

Helge

10 reviews3 followers

June 29, 2013

There are probably few ways of getting an insight into a topic, better than reading up on various perspectives from people dealing with it, each in their own field of work. By pointing their fingers at the critical aspects, the authors will make you think about your own work and about what can be done to improve your approach to data.

big-data programming

Mark

Author 2 books12 followers

February 9, 2013

Nice idea, but a collection of chapters by different authors. I enjoyed some more than others, as you might expect, but hardly a handbook of anything. I wonder if you could write a handbook on this topic.

statistics-etc

Cory

23 reviews10 followers

August 22, 2016

This is a good one! Wide range of essays - some more entertaining, some more directly useful - discussing all manner of problems that might be encountered when dealing with data. Well edited and organized.

technical

Tadas Talaikis

Author 7 books78 followers

September 26, 2016

Title should be changed (-1 star for that), - "handbooked" applications (for corrections) anyway should be thought off. As a consequence - too much read for few relatively simple concepts (that come from statistics anyway).

Alex Ott

Author 3 books207 followers

July 19, 2014

Set of the essays on the data cleanup, preprocessing, etc. but many things are too obvious, too from the common sense.
Maybe I'm working too long with imperfect data,...

P.S. Between 2 & 3

big-data ir-dm-nlp-ml-search own-ebook

Steef Jacobson

38 reviews16 followers

December 6, 2012

I thought I was the only one with this problem. The book shows great examples and goes through the process to clean up the data so it can be usable.

Ben Sowell

21 reviews

July 11, 2013

Disappointing. This type of technical essay collection doesn't seem to lend itself to much depth.

2013 ipad technical

Honza

6 reviews11 followers

October 13, 2013

Not bad, but I was expecting much more.

John Fredrickson

729 reviews24 followers

September 23, 2014

Uneven quality between chapters, but much of the book was very good and informative.

computing data-science

Bernardo Botella

7 reviews

August 2, 2015

It's a kind of cookbook showing specific solutions to specific problems. The given examples are not too bad, but it would be nice to have them better described.