In 1997 I was hired by Dresdner Kleinwort Benson to run their Euro, Y2K and Regulatory Minimum Requirements Programmes. Three classic “mandatory” projects that engendered classic responses: everyone agreed the projects needed to be done, everyone had an opinion on the how and what of each project, but nobody wanted the job of actually getting them done. It turned out fine for me, though: the programmes formed an excellent induction to the bank’s business: its customers, its culture, its processes, its people, its infrastructure and its systems estate. The three years I spent doing all that helped me enormously with my next role there, as CIO.
Not surprisingly, one of the first things we did was to baseline everything, starting with the systems and applications estate. Questionnaires were sent out, validation fieldwork followed, the database constructed, every system, associated version details, location, architecture, feeds, interfaces, usage characteristics, business sponsors, systems “owners”, the whole nine yards. We had ourselves the basics of a detailed inventory of the systems estate.
I then wanted to apply the Eric Raymond version of Linus’s Law to the inventory, on the basis that “Given enough eyeballs, all bugs are shallow” holds true for “information” bugs as well. Open up the data to everyone in the bank, so that errors and omissions could be identified quickly and efficiently. Simple, no?
Not simple at all.
Someone in information security felt that by making all this data available, I could be providing criminals and terrorists with a detailed set of instructions for a full-throated attack on our estate. It took a few months of robust debate (and a few changes of personnel) to get them to change their minds. What matters is that they did change their minds, the inventory was made available to bank staff, partners, suppliers and contractors; errors and omissions did surface and were dealt with in time. This in turn created a more usable information asset, helped us reduce project risks and costs, and paved the way to our doing everything that needed doing with a minimum of fuss and bother.
I was reminded of this last week when reading Rollie Cole’s excellent post Some Observations on the Practice of Open Data as Opposed to its Promise. If you have the time, do read the whole thing, it’s worth it. In it he analyses the reasons why people hold back from making open data available, the legal and operational issues involved, the heterogeneous nature of the environment at present and the resource implications. He then proceeds to summarise how these issues are being dealt with in pragmatic terms.
I’ve been privileged to know Rollie for a while; we met as readers of Gordon Cook’s excellent Cook Report on Internet Protocol, and belong to a related, and very active, online discussion group. I’ve had a deep interest in open data for a while: during my time as Chief Scientist at BT, I was involved with the Web Science Trust, where Tim Berners-Lee, Wendy Hall, Nigel Shadbolt et al made me an instant convert; I have remained in close touch with the Trust since.
I love numbers. I can stay awake for long periods with my eyes closed working out the Fibonacci sequence from scratch; I can as easily put myself to sleep very quickly counting sheep in Fibonacci. I marvel at the beauty of the primes; I’m mesmerised by “circulating” decimals.
Not everyone is moved by numbers. For many of the people I meet, what they look for are stories, narratives they can follow. So, given the influence of the people at the Web Science Trust, and given the observations made by Rollie, I thought it was time I collected and published stories about open data.
The first one I want to share with you comes courtesy a friend of mine, Conor Ogle. There I was, gently sipping my green tea while on vacation in Barcelona last week …. I nearly choked when I saw what he’d shared on his Facebook timeline: 20,000 pregnant men on the NHS. I had to know more. So I read the associated article, felt I should delve into more detail, wanted to see the original as published in the British Medical Journal. Didn’t feel like taking a subscription out just for that one article. So I downloaded the BMJ iPad app, waited one week for the “current” issue (April 7th) to pass into “back issue” status, then bought the issue for £2.99, read the whole thing. Here’s the summary:
In an article headlined The importance of knowing context of hospital episode statistics when reconfiguring the NHS, the authors (Lauren Brennan, Mando Watson, Robert Klaber and Tagore Charles) look at data freely available on HESOnline. I quote from their study:
On average, 1600 adults aged over 30 apparently attend outpatient child and adolescent psychiatry services in England each year. Indeed, the number of adults attending outpatient paediatric services since 2003 has increased steadily, with a steep increase, to nearly 20,000, in 2009-10. Adults over 60 are also being admitted to inpatient child and adolescent psychiatry services.
The rest of the article goes on in similar vein. Here’s another extract:
We were quite surprised to discover that many males seem to be attending outpatient obstetrics, gynaecology and midwifery services. Amazingly, between 2009 and 2010, there were over 17,000 male inpatient admissions to obstetric services and over 8,000 to gynaecology with nearly 20,000 midwife episodes.
They had good reason to be surprised. Apparently the term “midwife episode” is used to describe childbirth. So, according to the data, over 20,000 men had babies in 2010.
Now do I believe that over 20,000 men had babies in the UK in 2010. Of course not.
But do I believe that the data suggests that it happened? Unfortunately, yes.
Data. Erroneous data. Data that could be used to make erroneous decisions. Data that may already have been used to make erroneous decisions.
Data that should not be used to make erroneous decisions.
This is not a new thing. When I was at university, we were told horror stories about a couple of failed hydroelectric dam projects based on bad data related to river depth and speed, data that could have been corrected if it had been inspected and commented upon. Flawed reasoning about security led to the loss of lives and of economic and social well-being.
The authors of the article have performed an important public service in pointing out that there may be just a few, shall we say “classification” errors in the source data in HESOnline.
This is the 21st century. In years to come we’re going to have to deal with far larger data sets than we’ve ever dealt with before. Some of these data sets will turn out to be critical to our ability to deal with the problems of the 21st century, not just the ones we can see but also the ones we can’t as yet.
Let’s move to a different example, in the context of climate change: The International Centre for Integrated Mountain Development (ICIMOD) has been gathering and making available data on, amongst other things, the status of glaciers in the Hindu Kush-Himalayan Region. Quite important, these glaciers. They’re pretty involved in nature’s cycle of getting drinking water to maybe a third of the world’s population. We just haven’t had the tools or the ability to get detailed information about the glaciers in the past; now we can, and the data sets that are generated are immense. Incidentally, it’s heartening to see the number of countries involved in the study, countries that do not form part of the geographical region. Global issues need global participation.
Having accurate data is very important in our quest to solve many of the larger and newer problems we face. Without it, we get posturing and bigotry rather than reasoned and informed debate. Disease control. Climate change. AIDS. Eradication of poverty. The financial crisis. Education. Government. In every case, there seems to be an inverse relationship between the volume of the arguments and the quality of the data made freely available.
Arthur Wellesley, the first Duke of Wellington, is recorded as having said Publish and Be Damned, in response to journalist John Joseph Stockdale’s threat to expose his affair and mistress. The consequences were not of import to us.
Soon, humanity may be dealing with a variant, with consequences of import.
Publish. Or be damned.
When it comes to the open data movement, those are the choices we face. Publish, so we can correct the errors and make better decisions as a result. Or be damned by poor decisions.
And large quantities of pregnant men.