Analysing data for security: Five misconceptions…

1. What is transatlantic data sharing?
After the events of 9/11 2001, much of the emphasis on possible ‘preventability’ was placed on the problems of ‘stovepiped data’ – data held in separate databases that could not be integrated or cross-searched. A perusal of all of the 9/11 Commission ‘updates’ reveals that it has been the mining and analysis of ‘terrorism related’ data that has been the mainstay of the oft-cited ‘more imaginative’ approaches to intelligence http://www.dhs.gov/xlibrary/assets/implementing-9-11-commission-report-progress-2011.pdf Following the transatlantic bomb plot of 2006 http://www.guardian.co.uk/uk/2009/sep/07/transatlantic-airline-bomb-plot-timeline the US authorities exertrenewed pressure on the UK government for access to commercial data that is thought to reveal ‘patterns of note or interest’. Consider, for example, the centrality of travel patterns to the subsequent trial and conviction of the transatlantic bomb plot group – by 2007, the US secretary for Homeland Security Michael Chertoff is visiting the UK and European Parliament to make the case for US access to European passenger name record data (PNR) http://useu.usmission.gov/may1407_chertoff_ep.html In some ways PNR data is not dissimilar to the kinds of social network media data that have now emerged a part of analytics based programmes. PNR data contains both structured and unstructured data and it is mined as much for what it says about people’s associations, links and networks (tickets bought on the same credit card, shared telephone contact details, seat booking details, etc.) as for what is more often thought of as ‘content’.

2. So is it secret? Did we already know?
It does seem to be the particular form of the data that has been at the heart of the controversy. Data on our online habits and proclivities seem to be particularly intimate or private. In fact, though, the mining and analysis of what we think of as personal data has a much longer history. In June 2006, the New York Times revealed the operation of the Terrorist Financing Tracking Programme (TFTP) in which the US Treasury requested and analysed “blackboxed” data on financial transactions from the Belgium based SWIFT. In effect, the access to SWIFT data enabled the US authorities to analyse credit card and wire transfer transactions from across the whole of Europe. Together with Marieke de Goede and Mara Wesseling, I write about it here http://www.tandfonline.com/doi/abs/10.1080/17530350.2012.640554?journalCode=rjce20#.UbWlq8qbjt0 The analysis of financial transactions is not surveillance of economic life so much as the targeting of ways of life that are thought to be revealed by the transactions (e.g. wire transfers to specific parts of the world).
So, along with the analysis of PNR data (1 above), it is public knowledge that the mining and analysis of large volumes of structured and unstructured commercial data, provided by private companies, has been taking place since at least 2006. My project with Marieke de Goede ‘Data Wars’ assessed some of these programmes since 2008 http://www.esrc.ac.uk/my-esrc/grants/RES-062-23-0594/outputs/read/eb259903-1142-4a29-9728-46e5947c8530
What is also known publicly is that both European and US security authorities have extended what counts as ‘intelligence’ material to include the unstructured data of internet search engines and social network media. It should be remembered that most of this is open source material. In 2011, the US Director of National Intelligence, James Clapper, publicly spoke of the value of fragments of data that can be associated together with other elements to build an intelligence picture:
“To develop security programs that take full advantage of the fragmentary intelligence information we need something else. For too long the only responses to the incomplete threat information we collected on Al Qaeda operatives was a general colour coded terrorism warning or the no-fly list. We needed to do better. I engaged personally with Secretary Napolitano on this issue early in 2010, and DHS developed several imaginative programs to take advantage of partial intelligence to guide the screening at border entry points.”
Now, two years later, in effect he is disavowing these very same imaginative programmes. Lest we forget Europe’s leadership in social network analysis for security purposes, this is what Europol say in their 2012 annual review:
“Europol has adopted state-of-the-art social network analysis (SNA) as an innovative way to conduct intelligence analysis and support major investigations on organised crime and terrorism. Intelligence analysts are now able to deploy mathematical algorithms to map and measure complex and/or large data sets and quickly identify key players, groups of target suspects and other hidden patterns that would otherwise remain unnoticed. SNA is a valuable approach that complements conventional link analysis techniques, enhances the quality of intelligence reporting and helps to prioritise investigative work”.
Taken together, the public knowledge of SWIFT and PNR and the publicly made claims about new imaginative forms of intelligence provide evidence of the use of algorithmic techniques to mine and analyse social media data.

3. Is it ‘data’ that is collected?
Yes. The talk of ‘metadata’ imagines that the spoken content of a telephone conversation is private data, whilst the call number, location, time of call etc is ‘meta’. In fact, the point of what I have called in my work ‘data derivatives’ is that the associative links between elements are the valuable data points in themselves in social network analysis. http://tcs.sagepub.com/content/28/6/24.abstract
In a sense, the ‘content’ of the “dots to be connected” matters less to this kind of analysis than the relations between them. This is considered a security virtue because it makes it more difficult to assume that content is ‘innocent’ – in effect, apparently innocent content can become suspicious because it is subsequently associated with other things. Why does this matter? Because all forms of data become a resource to security – even normal and everyday data trails are required by the software if it is to learn what is ‘normal’, say, on the London Underground at this time on this day and in these circumstances. Predictive analytics are data hungry!

4. Should we be worried that GCHQ generated 197 intelligence reports from PRISM last year?
Well, we should be more interested in how the 197 are arrived at. So, security agencies cannot investigate every lead. The 197 represent persons or objects of interest that have come to attention after a process of running large volumes of data through the analytics software. It is the software that helps decide what the elements of interest should be – this travel associated with this financial transaction, associated with this presence in an open source online community, associated with this presence on Facebook. So, put simply, the vast bulk of the filtering and analysis happens before any named individuals or lists are identified. Again, there is publicly available discussion of this kind of process, much of it in the proceedings of computer science conferences http://www.nap.edu/catalog.php?record_id=10940

5. Do the security authorities “reach into” the servers of the internet service providers?
We can’t be sure, but the point is they don’t need to. In the case of SWIFT the data was provided in a blackboxed ‘mirror database’. The use of mirror databases means that, in practice, it would be possible for the commercial firms to report that they had not provided data – data would not be transmitted from the company, nor would it be ‘pulled’ by the US Treasury (in SWIFT case). The mirror database would generate the data copies from any server, and then the mirror database would be accessed for analysis. The complexity of jurisdiction and applicable law should be compared to what we call ‘offshore’ when we think of finance. We are familiar with the idea that money held in certain spaces is not subject to the strict application of the regulations of a particular jurisdiction. This is increasingly also the case with data, particularly regarding the spaces of the ‘cloud’ and interoperable but dispersed databases. We should not forget that most of the data analysed is either open source or already available via commercial transactions (e.g. the PNR data contained in airline bookings systems).

In short, we are seeing now a public debate that arguably should have started almost a decade ago. When talk of catching terrorists ahead of attacks by ‘connecting the dots’ (happens after every attack, most recently Boston and Woolwich) begins again, we must consider what these dots are and how they are connected.