Causality is a tricky thing. Kant argued for it’s existence as a pure concept of the understanding, a framework without which understanding the world becomes impossible. Obviously, if the world is the kind of place that humans can manipulate, understanding the causality between two events is absolutely imperative. But how do humans actually determine causality? We do it by observation mostly. If B always follows A (or even just usually follows A) and we cannot find any other supervening factor that explains both events, then most humans are content to say that A causes B.
Of course, this is not technically correct. To truly determine the relationship between A and B we need to set up an experiment whereby we manipulate A and observe the corresponding event B if and only if the prescribed manipulation of B occurs. This works extraordinarily well for situations in which we are actively generating data by experiment but in observational contexts we are usually constrained to observing only correlations between events. This is not to say, however, that we cannot make statistical infereences about *likely* causality given merely observational data.
So, recall the situation with the Alfabank/Spectrum/Trump Health DNS (ASTDNS henceforth) request data. The data is obviously only observational, being historical data. So an outright experiment is beyond the realm of the possible. So what can we do to determine which of these systems is the prime mover? Can we use the data to draw some trustworthy conclusions? It turns out we can if we use a tool known as “Granger Causality”.
Granger causality was created by economist Clive Granger in 1969 and was originally advanced as an analytical tool for understanding econometric time series data. While it isn’t actually able to determine if activity observed in one set of data “causes” the activity in another set, it is able to identify if one set is useful for predicting the other. I won’t get into the exact method by which this is computed, but if you’re interested the Scholarpedia article is actually pretty good.
So there are a few scenarios with the ASTDNS data that are plausible. I’ve illustated some of them in the figure below:
First, it’s possible that activity from the Trump server is driving the ASTDNS activity in both the Alfa and Spectrum DNS lookups. Unfortunately, this requires data that we don’t have and we have to bracket this for the moment. Secondly, It’s possible that Spectrum activity is somehow driving the Alfabank activity (with the relationship to the Trump server unknown). The third scenario is that Alfabank is driving the activity at Spectrum Health. Fourthly it’s possible for cyclie causality, Spectrum and Alfabank are recurrently driving each other, back and forth ( like ping pong). This could either be implemented through direct connections to each other, and or through an intermediary (perhaps the Trump server).
Anyhow, I wrote up some python code to compute Granger causality matrices for the 2×2 grid of possibilities in the figure above, just to see what falls out of the analysis. I binned the ping data into 24 hour segments and counted the number of DNS requests in each day-long bin for both servers. This collection of counts was used the input to the causality analysis.
The output of the analysis looks like this:
Scenario 1: Alfabank Granger Causes Spectrum
Alfabank Granger Causes Spectrum
lag = 1 p-value = 0.562684890646
lag = 2 p-value = 0.336075505277
lag = 3 p-value = 0.846784853068
lag = 4 p-value = 0.309622087085
lag = 5 p-value = 0.487775762199
lag = 6 p-value = 0.486892955779
lag = 7 p-value = 0.456408473073
lag = 8 p-value = 0.707146434731
lag = 9 p-value = 0.00977759072824
lag = 10 p-value = 0.0209185745729
Spectrum Granger Causes Alfabank
lag = 1 p-value = 0.00582228852867
lag = 2 p-value = 0.000621484343992
lag = 3 p-value = 3.17964724591e-06
lag = 4 p-value = 1.32091426033e-05
lag = 5 p-value = 6.16174177547e-07
lag = 6 p-value = 2.87035433336e-06
lag = 7 p-value = 1.77655969007e-06
lag = 8 p-value = 6.71059468613e-06
lag = 9 p-value = 2.01604202067e-05
lag = 10 p-value = 2.90239060447e-05
The pvalue, if you’re not familiar, tells you if your test is significant or not. If it’s lower than some comparison value (say 0.05), you can be pretty confident there is an interaction going on. The lag variable is shorthand for the “complexity” of the interaction you are presuming. Investigating Granger Causality with a lag of 1 indicates that we are interested if the value of next value of B can be predicted from the value of A one time-step in the past. A lag of two asks whether we can predict B given values of A that are one and two time steps into the past. And so on.
You want to choose the simplest model, all things being equal, so looking at the above results we find that Spectrum Granger Causes Alfabank is significant (less than 0.05 here) at a lag equal to 1. Alfabank Granger Causes Spectrum isn’g significant until we’ve got lags up to and including 9 time steps into the past. Clearly, the simpler assumption is that the Spectrum activity is somehow driving the Alfabank activity, scenario 2 in the figure.
Frankly, this surprised me. I had presumed Alfa would be driving things, but it looks like we can discount that. What we still can’t discount, though, is that the Trump server is taking some action that triggers activity in the Spectrum server and then, some time later triggers a correlated action in the Alfabank server. That would be extraordinarily strange, but it’s certainly plausible.
I don’t know that we are any closer to determining what’s going on here, and honestly it’s probably above my pay grade. One theory, put forward by Louise Mensch, is that we are seeing indications of some kind of “data laundering” operation. But hopefully someone with more experience in the analysis and interpretation of DNS lookup data can use the analytics I’ve worked up in the past few posts to begin to provide some real insight.