Causality is a tricky thing. Kant argued for it’s existence as a pure concept of the understanding, a framework without which understanding the world becomes impossible. Obviously, if the world is the kind of place that humans can manipulate, understanding the causality between two events is absolutely imperative. But how do humans actually determine causality? We do it by observation mostly. If B always follows A (or even just usually follows A) and we cannot find any other supervening factor that explains both events, then most humans are content to say that A causes B.

Of course, this is not technically correct. To truly determine the relationship between A and B we need to set up an experiment whereby we manipulate A and observe the corresponding event B if and only if the prescribed manipulation of B occurs. This works extraordinarily well for situations in which we are actively generating data by experiment but in observational contexts we are usually constrained to observing only correlations between events. This is not to say, however, that we cannot make statistical infereences about *likely* causality given merely observational data.

So, recall the situation with the Alfabank/Spectrum/Trump Health DNS (ASTDNS henceforth) request data. The data is obviously only observational, being historical data. So an outright experiment is beyond the realm of the possible. So what can we do to determine which of these systems is the prime mover? Can we use the data to draw some trustworthy conclusions? It turns out we can if we use a tool known as “Granger Causality”.

Granger causality was created by economist Clive Granger in 1969 and was originally advanced as an analytical tool for understanding econometric time series data. While it isn’t actually able to determine if activity observed in one set of data “causes” the activity in another set, it is able to identify if one set is useful for predicting the other. I won’t get into the exact method by which this is computed, but if you’re interested the Scholarpedia article is actually pretty good.

So there are a few scenarios with the ASTDNS data that are plausible. I’ve illustated some of them in the figure below:

First, it’s possible that activity from the Trump server is driving the ASTDNS activity in both the Alfa and Spectrum DNS lookups. Unfortunately, this requires data that we don’t have and we have to bracket this for the moment. Secondly, It’s possible that Spectrum activity is somehow driving the Alfabank activity (with the relationship to the Trump server unknown). The third scenario is that Alfabank is driving the activity at Spectrum Health. Fourthly it’s possible for cyclie causality, Spectrum and Alfabank are recurrently driving each other, back and forth ( like ping pong). This could either be implemented through direct connections to each other, and or through an intermediary (perhaps the Trump server).

Anyhow, I wrote up some python code to compute Granger causality matrices for the 2×2 grid of possibilities in the figure above, just to see what falls out of the analysis. I binned the ping data into 24 hour segments and counted the number of DNS requests in each day-long bin for both servers. This collection of counts was used the input to the causality analysis.

The output of the analysis looks like this:

Scenario 1: Alfabank Granger Causes Spectrum

Alfabank Granger Causes Spectrum

#############################################

lag = 1 p-value = 0.562684890646

lag = 2 p-value = 0.336075505277

lag = 3 p-value = 0.846784853068

lag = 4 p-value = 0.309622087085

lag = 5 p-value = 0.487775762199

lag = 6 p-value = 0.486892955779

lag = 7 p-value = 0.456408473073

lag = 8 p-value = 0.707146434731

lag = 9 p-value = 0.00977759072824

lag = 10 p-value = 0.0209185745729

Spectrum Granger Causes Alfabank

#############################################

lag = 1 p-value = 0.00582228852867

lag = 2 p-value = 0.000621484343992

lag = 3 p-value = 3.17964724591e-06

lag = 4 p-value = 1.32091426033e-05

lag = 5 p-value = 6.16174177547e-07

lag = 6 p-value = 2.87035433336e-06

lag = 7 p-value = 1.77655969007e-06

lag = 8 p-value = 6.71059468613e-06

lag = 9 p-value = 2.01604202067e-05

lag = 10 p-value = 2.90239060447e-05

The pvalue, if you’re not familiar, tells you if your test is significant or not. If it’s lower than some comparison value (say 0.05), you can be pretty confident there is an interaction going on. The lag variable is shorthand for the “complexity” of the interaction you are presuming. Investigating Granger Causality with a lag of 1 indicates that we are interested if the value of next value of B can be predicted from the value of A one time-step in the past. A lag of two asks whether we can predict B given values of A that are one and two time steps into the past. And so on.

You want to choose the simplest model, all things being equal, so looking at the above results we find that Spectrum Granger Causes Alfabank is significant (less than 0.05 here) at a lag equal to 1. Alfabank Granger Causes Spectrum isn’g significant until we’ve got lags up to and including 9 time steps into the past. Clearly, the simpler assumption is that the Spectrum activity is somehow driving the Alfabank activity, scenario 2 in the figure.

Frankly, this surprised me. I had presumed Alfa would be driving things, but it looks like we can discount that. What we still can’t discount, though, is that the Trump server is taking some action that triggers activity in the Spectrum server and then, some time later triggers a correlated action in the Alfabank server. That would be extraordinarily strange, but it’s certainly plausible.

I don’t know that we are any closer to determining what’s going on here, and honestly it’s probably above my pay grade. One theory, put forward by Louise Mensch, is that we are seeing indications of some kind of “data laundering” operation. But hopefully someone with more experience in the analysis and interpretation of DNS lookup data can use the analytics I’ve worked up in the past few posts to begin to provide some real insight.

This is an interesting way to look at the data. I’ve been analyzing the raw data too and have seen similar patterns.

It think there was a DNS cache at both sites so it’s hard to know when the actual exchanges were happening. Connections to the Trump server might have been happening as fast as once per minute in some cases.

Not sure if there’s a good way to factor that into your analysis.

Drop me an e-mail if you want me to share some of my data.

LikeLiked by 1 person

Would be interested, yes. I’m more of a data jockey than anything else, so eager to talk to people who can work in this domain more fluently. I’m not seeing an email addy attached to your username…I can set up a dedicated email if need be.

LikeLike

Happy to share what I’ve found.

You can reach me at davidm [at] chipshield.com

LikeLike