In the recently published paper “A Survey of Prediction Using Social Media” (by Sheng Yu and Subhash Kak) the authors establish three basic requirements that have to be fulfilled for it to be at all meaningful to examine social media for predictions. The requirements are interesting and open new questions for researchers.
- First, the prediction must be of a “human related event”.
- Second, the set of aggregated views have to mirror the real world.
- Third, the involved events should be easy to discuss in public.
I think the notion of human-related event is interesting in itself. Predicting the weather from social media would then be impossible, but is it really? We can imagine that people mention the weather enough for us to be able to a) predict the present state in any specific place, but also, knowing what we know about weather systems and weather prediction generally b) be able to at least try to beat the baseline hypothesis of “the weather tomorrow will be like today”. The authors claim that it would be possible to detect discussions about an eclipse, but not predict it, from social media. I am far from sure that is true – the distinction between human-influenced events and human-independent events is relevant to many models, but if we social media as simply a set of data that can be correlated to different events, then we should not overemphasize the human element.
The second requirement seems to make sense, with the caveat that the “real world” can be quite small. It would for example be possible to imagine a world in which we can predict who gets the Oscar from a set of bloggers and twitter accounts that are not representative of the general demographic in a country. There needs to be some kind of correlation between the social sources and the group whose actions we are trying to predict, but I am not sure that it needs to be a similar or the same composition, as the authors require.
The third requirement is that the issues must be talked about in public. They must be open, in a sense. That depends on what data sets we are discussing, and what we term social media, of course. But aggregations of search results do not necessarily qualify, I assume. And what does public mean? I get that the accessibility of the data sets matter, but is accessibility of the data set the same as it being spoken about in public? That seems like an unnecessarily strong requirement. Is tweeting something the same as talking about in public, for example?
These are but small notes in the margin. The paper very helpfully presents a great survey of the social media predictions work. It ends on a positive note:
[social media prediction] has created a new way for us to collect, extract and utilize the wisdom of crowds in an objective manner with low cost and high efficiency.
The really interesting perspectives open up when you think about what this will look like in ten years time, with the troves of data that we will then have access to as a society. I wrote an earlier note about psychohistory that I think opens up a few questions about that: