As I was "casually" surfing the web I stumbled upon some PHP files with random names. It turns out that these files are backdoors created by a hacking tool, which might warrant a post on its own later. The problem is that you can't really search for random data in a similar fashion as you can do with Google Dorks for specific files. So given a bunch of files, how would you find the ones with random names? In this post, I'll outline a statistical approach (#AI, #MachineLearning, #BigData, ...) I managed to use with some success to find multiple active and useable backdoors online. That being said, it's only a first step and far from perfect, any input on possible improvements is appreciated!
To shine some more light on what I mean, here is an example:
Which of these five files have a random name?
For anyone that knows English, this is pretty obvious, it's fkzptcdrrj.php.
While it's easy to tell in this case, how would you program a general solution for finding this file?
In the specific case of this malware, there are some things we know. The filename, excluding extension, is ten characters long and only lowercase letters. This is based on observations, I still haven't found the exact generating function for the names.
My first idea was to try some machine learning solution to detect English but I didn't like that solution. Firstly, it would fail for non-English names like "configuraracceso.php", "MotDePasse.php", etc. Secondly, it might also break for names with special characters like "edit_user.php", "wp-login.php", etc.
Since my wife is a librarian and a professional in data retrieval, I started by asking her. Understandably, she was a bit stumped at my request to find random names. However, before soon she recommended looking for names with "unlikely patterns".
Happily, I have created my own database with lots of file names that I use to uncover these unlikely patterns. The idea was to use this to create a Markov Chain with transition probabilities between characters in the names. Then finally, for each name, calculate the probability for the corresponding path in the chain. The names with the lowest probability are the "unlikely patterns".
To generate the Markov Chain I used approximately 16,000 unique file names of variable length converted to lowercase.
If we have the files "abc", "aaa", and "aac" we construct the following chain. We start with "abc" where we have one "a → b" transition and one "b → c" transition, then "aaa" with two "a → a" transition and finally "aac" with one "a → a" and one "a → c" transition. Now, for "a" there is a 1/5 chance the next character is "b", 3/5 it is "a" and a 1/5 that it is "c".
Starting with the Markov Chain, the transitions seem to make sense. "f → i" is very popular at about 18% (any file name with "file" in the name will add to this), while "f → v" is very unlikely at 0.02%. Below is the Markov Chain for the likely name "fileserver".
The top five most "likely" ten letter names results were:
And for the key results, the most "unlikely" names (slightly changed to preserve anonymity for victims):
I am very happy with these results! All top five results are from hacked servers. fkzptcdrrj.php was executable allowing for RCE. Actually, 1,2,4, and 5 are all from the same hacked server.
The seventh most unlikely pattern was an active web shell from the same malware authors that anyone could use on the infected server, as shown in the figure below.
While I think these results were surprisingly good I'm sure there is much room for improvement. Even continuing on the Markov Chain model there are many parameters. Should everything be converted to lowercase? Should only unique names be used? If the target length is 10, should only names of this length be used?
Looking at more than the top 5 results, there are quite a lot of false positives (non-random names) like "logoff_wtd", "notify_vtm", "bookflight". The first two are combinations of words and other data. "bookflight" is interesting as it is a combination of two normal English words, but the "k → f" transition is quite rare (1%). Some names arguably seem random, like "esp8266h2o", unless you know that "ESP8266" is a microchip. So perhaps including some word lists or knowledge databases could help eliminate false positives.
Almost eveyone I've told about this to have mentioned ENTROPY! "Just calculate the entropy", "find the name with highest entropy", etc. But this is really not straight forward. Sometimes a simplified version is used for passwords where entropy can be calculated as E = L * log2(R), where L is the length and R the size of the character set. This doesn't really help us in deciding if "admin" or "fkzpt" has the highest entropy, as the have the same length and possible same character set [a-z]. I believe the core problem is that we need to know the distribution of file names before entropy can be applied. Please let me know if you have a good idea on this track!
Markov chains are pretty good at finding probable patterns and consequently also non-patterns or randomness. Furthermore, multiple of the random files I found were indeed malicious with some even providing RCE on the infected servers.
Not if you consider letters separately, but you could split the word into n-grams (2- or 3- grams could be enough), and then compare by their entropy instead. "ad" should be a much more common 2-gram than "fk" in most popular languages.