A reader asked a question in the comments of the previous blog post about ThePlanet regarding the distribution of infections. The reader wanted to know if the rest of the infections not attributed to Skenzo and HostGator were evenly distributed with less than 10% of the total infections. The type of distribution the reader was describing is called a Power Law distribution. A power law distributed population will look something like this:

courtesy of Wikipedia.org
For this blog post I’m using data pulled from early May 2010 on AS21844 (ThePlanet) and find the infection counts are roughly power law distributed. I’ve gone over the methodology to obtain this data in previous posts but it bears mentioning that I am using data distributed by RWhois organization names. Later in this post I will look at the same data distributed by only IP address so that it can be compared with other AS blocks. The raw infection counts look like this in graphical format:

The shape is precisely the same and it is obvious that there are a lot of organizations that have only single and double digit infections attributed to them. The area between 500 and 2500 is entirely barren with only a single entry beyond 2500. One of the issues when looking at data like this is the blur of data points below the 500 marker. One could simply strip away the outliers (those data points above 500) but in this particular case I don’t think that is an effective way to view the data. In statistics people often “transform” the data to deal with this situation. This generally means they divide all the numbers by some constant which allows the data to retain the same shape but become easier to read. I generally favor the log/log method which means I take the log of each number and graph it that way. Log (or logarithm) is a mathematical function best explained by Wikipedia but best thought of as a number “reducer” that can be applied uniformly across data.
To get a sense of the scale the log of 2500 is 7.8, the log of 500 is 6.2, the log of 100 is 4.6 and the log of 1 is 0. Once the data is transformed we can see there is a little variance in the actual distribution but the fact that the line is sloping downward like that is another very good indicator of the power law distribution.
