Friday, May 27, 2011

Empirics is naked without theory

Yesterday we saw that if you make your way through Wikipedia by repeatedly clicking the first (unitalicized or parenthetical) link of the current article, you will probably end up stuck in this cycle:
{philosophy, existence, senses, physiological, science, knowledge, facts, information, sequence, mathematics, quantity, property, modern philosophy, philosophy}

Today Kevin Stock provides some stats on Wikipedia links:

I wanted to know exactly how many articles would lead to Philosophy. Thanks to Wikipedia for providing complete archives I was able to parse through the latest complete dump which just finished this morning and generate stats on where most pages eventually lead, which is here as the percentage of pages which reach the given page (note that the pages tied at 93.39% form a loop):

93.39% Sense
93.39% Philosophy
93.39% Perception
93.39% Panpsychism
93.39% Mind
93.39% Existence
93.39% Consciousness
93.39% Awareness
89.60% Modern philosophy
89.60% Property (philosophy)
89.17% Quantity
89.15% Mathematics
84.78% Sequence
84.78% Information
83.72% Fact
83.72% Knowledge
78.66% Science
62.09% Natural science
32.73% Physics
28.74% Biology
23.98% Extant taxon
23.97% Human
19.05% Observable
19.05% Event
19.05% Causality
19.05% Interaction
19.05% Community
18.93% Academia
18.82% List of academic disciplines
17.72% Social sciences
15.18% State (polity)
10.08% Language

Interesting, Kevin!

By the way, these percentages would seem to be an empirical question. But here we see the value of theory vis-a-vis empirics. Because, we have a solid (in fact, ironclad) theory that says any article in the same loop as Philosophy must be arrived at from the same percentage of articles. Yet he estimates that Mathematics and Philosophy, for example, are not arrived at from the same number of articles. So (unless the link structure changed between the time of my post and his data), something is slightly wrong with his code.

We want our empirical estimates to satisfy the restrictions imposed by the theory. And when they don't, we know something's wrong in the data or code. And then we know to look harder at our data and code. What's more, the particular nature of the violation can give us a good idea of where to look.

(Of course more generally there is a give and take between theory and empirics. In economics the theories are never so ironclad...)

Q-tip: Paul Ho!


  1. Very interesting.

    I tried it out a few times and I did indeed keep ending up in the cycle. I also played around with not picking the first word but picking the first noun, or even picking a semi-random link, and I still eventually ended up in the cycle.

  2. I just tried picking the second link repeatedly, and a couple times I actually ended up in the loop but then escaped it and got stuck in a dinky side loop. I think the second link rule may be particularly pathological though.