In 1997 I was hired by Dresdner Kleinwort Benson to run their Euro, Y2K and Regulatory Minimum Requirements Programmes. Three classic “mandatory” projects that engendered classic responses: everyone agreed the projects needed to be done, everyone had an opinion on the how and what of each project, but nobody wanted the job of actually getting them done. It turned out fine for me, though: the programmes formed an excellent induction to the bank’s business: its customers, its culture, its processes, its people, its infrastructure and its systems estate. The three years I spent doing all that helped me enormously with my next role there, as CIO.
Not surprisingly, one of the first things we did was to baseline everything, starting with the systems and applications estate. Questionnaires were sent out, validation fieldwork followed, the database constructed, every system, associated version details, location, architecture, feeds, interfaces, usage characteristics, business sponsors, systems “owners”, the whole nine yards. We had ourselves the basics of a detailed inventory of the systems estate.
I then wanted to apply the Eric Raymond version of Linus’s Law to the inventory, on the basis that “Given enough eyeballs, all bugs are shallow” holds true for “information” bugs as well. Open up the data to everyone in the bank, so that errors and omissions could be identified quickly and efficiently. Simple, no?
No.
Not simple at all.
Someone in information security felt that by making all this data available, I could be providing criminals and terrorists with a detailed set of instructions for a full-throated attack on our estate. It took a few months of robust debate (and a few changes of personnel) to get them to change their minds. What matters is that they did change their minds, the inventory was made available to bank staff, partners, suppliers and contractors; errors and omissions did surface and were dealt with in time. This in turn created a more usable information asset, helped us reduce project risks and costs, and paved the way to our doing everything that needed doing with a minimum of fuss and bother.
I was reminded of this last week when reading Rollie Cole’s excellent post Some Observations on the Practice of Open Data as Opposed to its Promise. If you have the time, do read the whole thing, it’s worth it. In it he analyses the reasons why people hold back from making open data available, the legal and operational issues involved, the heterogeneous nature of the environment at present and the resource implications. He then proceeds to summarise how these issues are being dealt with in pragmatic terms.
I’ve been privileged to know Rollie for a while; we met as readers of Gordon Cook’s excellent Cook Report on Internet Protocol, and belong to a related, and very active, online discussion group. I’ve had a deep interest in open data for a while: during my time as Chief Scientist at BT, I was involved with the Web Science Trust, where Tim Berners-Lee, Wendy Hall, Nigel Shadbolt et al made me an instant convert; I have remained in close touch with the Trust since.
I love numbers. I can stay awake for long periods with my eyes closed working out the Fibonacci sequence from scratch; I can as easily put myself to sleep very quickly counting sheep in Fibonacci. I marvel at the beauty of the primes; I’m mesmerised by “circulating” decimals.
Not everyone is moved by numbers. For many of the people I meet, what they look for are stories, narratives they can follow. So, given the influence of the people at the Web Science Trust, and given the observations made by Rollie, I thought it was time I collected and published stories about open data.
The first one I want to share with you comes courtesy a friend of mine, Conor Ogle. There I was, gently sipping my green tea while on vacation in Barcelona last week …. I nearly choked when I saw what he’d shared on his Facebook timeline: 20,000 pregnant men on the NHS. I had to know more. So I read the associated article, felt I should delve into more detail, wanted to see the original as published in the British Medical Journal. Didn’t feel like taking a subscription out just for that one article. So I downloaded the BMJ iPad app, waited one week for the “current” issue (April 7th) to pass into “back issue” status, then bought the issue for £2.99, read the whole thing. Here’s the summary:
In an article headlined The importance of knowing context of hospital episode statistics when reconfiguring the NHS, the authors (Lauren Brennan, Mando Watson, Robert Klaber and Tagore Charles) look at data freely available on HESOnline. I quote from their study:
On average, 1600 adults aged over 30 apparently attend outpatient child and adolescent psychiatry services in England each year. Indeed, the number of adults attending outpatient paediatric services since 2003 has increased steadily, with a steep increase, to nearly 20,000, in 2009-10. Adults over 60 are also being admitted to inpatient child and adolescent psychiatry services.
The rest of the article goes on in similar vein. Here’s another extract:
We were quite surprised to discover that many males seem to be attending outpatient obstetrics, gynaecology and midwifery services. Amazingly, between 2009 and 2010, there were over 17,000 male inpatient admissions to obstetric services and over 8,000 to gynaecology with nearly 20,000 midwife episodes.
They had good reason to be surprised. Apparently the term “midwife episode” is used to describe childbirth. So, according to the data, over 20,000 men had babies in 2010.
Now do I believe that over 20,000 men had babies in the UK in 2010. Of course not.
But do I believe that the data suggests that it happened? Unfortunately, yes.
Data. Erroneous data. Data that could be used to make erroneous decisions. Data that may already have been used to make erroneous decisions.
Data that should not be used to make erroneous decisions.
This is not a new thing. When I was at university, we were told horror stories about a couple of failed hydroelectric dam projects based on bad data related to river depth and speed, data that could have been corrected if it had been inspected and commented upon. Flawed reasoning about security led to the loss of lives and of economic and social well-being.
The authors of the article have performed an important public service in pointing out that there may be just a few, shall we say “classification” errors in the source data in HESOnline.
This is the 21st century. In years to come we’re going to have to deal with far larger data sets than we’ve ever dealt with before. Some of these data sets will turn out to be critical to our ability to deal with the problems of the 21st century, not just the ones we can see but also the ones we can’t as yet.
Let’s move to a different example, in the context of climate change: The International Centre for Integrated Mountain Development (ICIMOD) has been gathering and making available data on, amongst other things, the status of glaciers in the Hindu Kush-Himalayan Region. Quite important, these glaciers. They’re pretty involved in nature’s cycle of getting drinking water to maybe a third of the world’s population. We just haven’t had the tools or the ability to get detailed information about the glaciers in the past; now we can, and the data sets that are generated are immense. Incidentally, it’s heartening to see the number of countries involved in the study, countries that do not form part of the geographical region. Global issues need global participation.
Having accurate data is very important in our quest to solve many of the larger and newer problems we face. Without it, we get posturing and bigotry rather than reasoned and informed debate. Disease control. Climate change. AIDS. Eradication of poverty. The financial crisis. Education. Government. In every case, there seems to be an inverse relationship between the volume of the arguments and the quality of the data made freely available.
Arthur Wellesley, the first Duke of Wellington, is recorded as having said Publish and Be Damned, in response to journalist John Joseph Stockdale’s threat to expose his affair and mistress. The consequences were not of import to us.
Soon, humanity may be dealing with a variant, with consequences of import.
Publish. Or be damned.
When it comes to the open data movement, those are the choices we face. Publish, so we can correct the errors and make better decisions as a result. Or be damned by poor decisions.
And large quantities of pregnant men.
It’s also a big worry in life science R&D. High-throughput screens (HTS) of hundreds of thousands of sample points show affinity for ligand L against target T, papers are published where controls aren’t fully implemented, or data is harvested selectively, and before you know it, you have a drug failing in Phase III. Seems far-fetched? Consider what’s happened with Sanofi’s iniparib last year. It seems likely that there were data artifacts (and wishful thinking) all the way from hit discovery through to Phase II trials which mislead Sanofi as to the actual effectiveness of iniparib, and led to a spurious gating decision when moving the drug to Phase III.
Thank you for this dose of reality in increasingly data-driven internal systems of record and hierarchical decision trees. After all, the more data-driven it becomes, the less people are needed to be involved in the decision. So, it’s time to make our systems and processes more people-centric (aka let people look at the data and the context) instead of assigning the task to a few selected who can make sense of it. After all, those few will probably find nothing odd about pregant men.
JP, spot on – as usual… Have I mentioned Jeff Jonas’s work to you before, or have you already come across him? He is a very interesting guy (and great company) I think you will find some of his thinking of considerable interest…
http://jeffjonas.typepad.com/about.html
Chris Roebuck, HES Programme Manager, HSCIC (not verified) *does* suggest the “pregnant men” isn’t as bad as painted…
“HES is rich in detail and potentially can be a powerful driver for decision-making, as Dr Brennan rightly points out. Her study demonstrates this in several ways. It is an interesting and correct assertion that thousands of male finished consultant episodes (FCEs) were recorded under the obstetrics, gynaecological and midwifery specialisms. At first glance this would appear out of the ordinary.
However when we analysed the date further, by age as well as gender; almost all male FCES related to new-born or very young babies: with 96 per cent relating to babies less than one week old.
We are very keen to support initiatives such as this and always willing to offer our assistance with correct interpretation of data.”
@Daen @Joachim I think we have to keep focusing on machines filtering and humans curating, augmenting the mechanical activity with human passion and intellect. Otherwise there are going to be many and different instances of what Kevin Slavin talked about in terms of algorithms going bad.
Tim O’Reilly introduced me to Jeff Jonas maybe a couple of years ago, but I don’t really know him. Have read his stuff though.
@Steve, not sure about it. It seemed to be pretty bad. GIGO. No real incentives to get it right, poor and misleading classification choices, underskilled data entry, the lot.
@JP Yes, absolutely, sort of connected in a chain, I.e. Humans interpret, not go out and curate on their own (that was KM1.0)
Yet another excellent thought provoking post, which also made me smile, possible for the wrong reasons.
Although I have a slight different, satirical twist. There are a lot of issues with the carbon based processing units and the way they used the data, “computer says no” & “it must be right, the computer told me” spring to mind
Even simple data sets like stock control, seem to be more than the average carbon based unit can cope with. Many years ago I went to a wholesaler to buy a piece of hardware. Pointing at a shelf of items, I said “Can I have one of those please?” The member of staff tapped away at the keyboard and said, “we don’t have any in stock”……
@species5618 I agree, when carbon-based units act as servants to silicon-based units, the outcomes are depressing. We have to keep ensuring that it does not happen; and when it happens, we have to keep ensuring that we change it. There are cases where it would appear carbon-based units have forgotten how to think. But that can be amended.