Thinking about web statistics

This post was triggered in part by a story in today’s Wall Street Journal, where an apparently biased survey suggested that millions of teenagers were buying alcohol online in the US.

The phrase “Lies, damned lies, and statistics” appears to be well over a century old, so I’m not treading new ground here; I think it’s more like combing through fossil remains. There are many attributions; the earliest published reference is in the late nineteenth century, and received wisdom suggests that the originator of the phrase was Disraeli, while it was given currency by Twain.

It is instructive to watch what’s happening with web statistics; the more I look at it, the more I feel that we need new ways of measuring what happens on the web.

I think the web is disaggregating something that has always been a centralised processes of collecting relevant numbers; I think the web is often disintermediating the central specialised body that produces the numbers as well. This makes a lot of people uncomfortable, since they lose the ability to preview and massage the numbers. And I think something else is going on; Chris Anderson’s Long Tail is not something that sits well with central-minded people either.

Let’s take a few examples:

Some months ago, RageBoy was commenting on how the geographical distribution of his readership seemed to vary according to the tool used, with some tools showing hits in the most outrageous places. And remember this is RageBoy I’m talking about, so I mean outrageous when I say outrageous. [Couldn’t find the original post. Chris?] I have seen some evidence to support this, and all I can say is that we’re not very good at this right now. Sure, new and better tools are coming along, but my perception is that Web 1.0 does not make this easy, the infrastructure can be gamed by accident or design.

Maybe a week ago, Miss Rogue commented that it was all a farce anyway, talking about how ranking and hits were being gamed; when she comes back from the film, she’d probably say these numbers are like Counting Snakes On a Plane :-)
And then today Doc Searls gave a Harry Frankfurt response to some noise on the web to do with A-list bloggers and traffic and hits and search engine optimisation and all that jazz. I quote:

Nick also says,
As the blogophere has become more rigidly hierarchical, not by design but as a natural consequence of hyperlinking patterns, filtering algorithms, aggregation engines, and subscription and syndication technologies, not to mention human nature, it has turned into a grand system of patronage operated – with the best of intentions, mind you – by a tiny, self-perpetuating elite. A blog-peasant, one of the Great Unread, comes to the wall of the castle to offer a tribute to a royal, and the royal drops a couple of coins of attention into the peasant’s little purse. The peasant is happy, and the royal’s hold over his position in the castle is a little bit stronger.
Bullshit.
Want to succeed in the blogosphere, or the Web in general? Easy. Do search engine optimization. Here’s how:

  1. Write quotable stuff about a lot of different subjects.
  2. Do it consistently, for months if not years.
  3. Link a lot, as a way of giving credit and of sending readers to other sources of whatever it is you write about.
That’s it.
I can’t promise royalty, because there isn’t any. But I can promise a rewarding relationship with the readers you’ll get, regardless of how many there are.

Wonderful stuff.

Update. Here’s Hugh on the Carr piece:

  • There are basically two rules of blogging:
  • 1. Nobody is going to read your blog unless there’s something in it for them.
  • 2. Nobody is going to link to your blog unless there’s something in it for them.
  • These two rules apply to us all, A-List and Z-List alike. If you don’t like these rules, you’re better off finding an ecology whose rules you like better. Life is short.

These are serious issues, I will come to the reasons shortly.

But in the meantime. Maybe many of us know that the numbers are not that reliable, but maybe many of us don’t care too much about it. I look at my Technorati rank, sure. And I learn something about how it works and what it means. And yes I get a kick out of being in the top 10K, but not that big a kick. Because I don’t blog for my technorati ranking. What I really use Technorati for is first and foremost to find stuff in the blogosphere. And then maybe learn a little about the wisdom as well as the madness of crowds, by looking at what appears to be popular, but only at the tag level. It is rare that I delve deeper. And I also use Technorati to find out who’s linking to me. If markets are conversations (which they are) and blogs are the opensourcing of ideas (which they are) then it seems to make sense to find out just who you’re talking to. Relationships not transactions. Covenant not contract.

Now to the meat. And why I wrote this post.
Traditional thinking, pre-web, pre-Long-Tail, liked to use surveys and sampling techniques and normal distributions and a bunch of other stuff in order to define something they called audiences and traffic. Traffic they liked to measure in order to figure out something called hits. Hits that denoted their incredible ability to market something called content.

Sampling. Traffic. Hits. Content. Stanchions of the past. Pillars that are the lychgate to the churchyard of an obsolescing age.

They just don’t get micromarkets and microconversations and non-broadcast-mode and non-centrally administered and not content and not audience and not hits.
An age that is not yet obsolete. An age where attempts will be made to maintain, even strengthen, these stanchions.

And how will this strengthening take place?

With numbers. Numbers that you and I know are, shall we say, weak. Numbers that will nevertheless be used to “educate” people, particularly those that create legislative support. Which translates into lock-ins and protection and “advertising” and annuity revenue streams and all that jazz.
So next time you see numbers that tell you just how many gazillion illegal downloads happened while you read this sentence, how many gazillion dollars it will take to provide the infrastructure for all this live TV that is clogging up the tubes and slowing down someone’s internet, how many multigazillion illegal copies of software already exist on consumer desktops, or for that matter how much of the internet is dedicated to filesharing, next time you see all this, don’t be surprised. Don’t ask “How can it be?”. You have to be able to measure the problem in order to get the protection.
When you see bloggers being called A-List against their wishes, don’t ask “How can this be?”. You have to have hits.

When you see web sites with unbelievable links and hits and distribution, don’t ask “How can this be?”. You have to have audiences and traffic.
But don’t worry. The Emperor Has No Clothes On. People will get wise to this. As the numbers get better. Which they will.

2 thoughts on “Thinking about web statistics”

  1. At present, there are more new accounts created per day on MySpace than there are ‘hits’ on most sites. This strongly affects the relative ranking of sites outside the top 10.

    I.e. The web is growing very fast and the volatility of the time series data associated with this is high, relative to the absolute size of most reported time series.

    Therefore, almost all statistics that you read about are useless BS because they are calculated and quoted out of context.

    Attribution – this is not my own thought, but due to a commenter in the Q&A session after this talk at OSCON:
    http://conferences.oreillynet.com/cs/os2006/view/e_sess/8751

    Relative trend data comparing sites whose traffic is on a similar order of magnitude, averaged over long time periods, is useful. Reported revenues can also warrant merit, but companies can lie about those from time to time.

    So, yes, I agree the Numbers Will Get Better :-)

  2. Well put, J.P.

    My take at the moment is that we’re .X nanoseconds into the Big Bangs of many polyverses, and we’re all just guessing about what’s happening and where it’s going. Galaxies will form, yes; but beyond that all bets are long. Somehow, when you add numbers to those bets, the odds get even longer. There’s a principle in here somewhere, and I think you’re hip to it.

    Meanwhile, it rocks to surf the bow wave of a ship too big to measure.

    Doc

Let me know what you think