What We Learned at ThePlanet (AS21844)

Posted by Oliver Day Tue, 06 Jul 2010 16:04:05 GMT

After months of looking into the infections of AS21844 (ThePlanet) we've decided to wrap up our investigations for now.  We have learned quite a bit from our communications with customers at ThePlanet.  While no one from ThePlanet has spoken with us officially we have learned that they possess a direct feed of infected URLs from Google.  This means that large customers of ThePlanet, such as HostGator, should have the ability to learn of infections directly from their provider.  Also partners such as Skenzo should be able to use the same list to purge previously infected, and now abandoned, domains from their monetization framework.  

For those of you that look at our Top 50 Infected Networks you'll notice that ThePlanet is still at the top.  There really should be an asterisk up there since some of those infections shouldn't be counted.  In particular the Skenzo related infections aren't actually a threat but are still listed due to a policy decision by the Safe Browsing team (which you can read about in a previous blog post).  The best solution for now is to get this list to Skenzo so they can remove it from their framework.  I am preparing this list for Skenzo right now but eventually, I hope, ThePlanet will provide it for them.  To add some transparency to our research I'll paste the top infected org names as reported by ThePlanet's RWhois server:

 

WebsiteWelcome 7909
Skenzo FZE 2838
Unidentified 683
Site5 LLC 430
Bahram Boutorabi 192
SiteGround.com Inc. 171
webserver-a-rackshack.directi.com 166
Mochanin Corp 136
maktoob.com 119
server sea 115
Payam Torkian 115
Our Internet_ Inc 112

 

Don't forget that some of the 7,909 infections listed as HostGator (WebsiteWelcome is the org name used by HostGator) are duplicates.  Our hosting providers tend to include multiple pages (and/or directories) per website host so these numbers require additional explanation.  If one were to sort the infections by unique domains alone the count would be noticeably less.  Applying some command line fu to one of the data files shows us the repetition is not nearly as high as it used to be.   Only four domains are repeated more than 10 times.

 

count domain
10 vadakarapally.org
12 attorney2traffic.org
16 e-sense.tv
17 niftysensex.com    


HostGator has roughly 7,563 unique infected domains according to our last count and ThePlanet has 20,298 unique infected domains with their true number likely around 17,000 (adjusting for Skenzo).  Where does that put ThePlanet in the context of our top 50 infected networks?  Exactly where they are now actually. The next closest network is GoDaddy's AS26496 with 11,576 infections.

Tags , , ,  | no comments

Thoughts on WEIS 2010

Posted by Oliver Day Wed, 09 Jun 2010 14:58:42 GMT

Earlier this week I sat in on the Workshop on the Economics of Information Security.  One of the more lively research papers presented was on insecurities in the online pornography industry.  The paper 0 has also been written about by Threatpost 1.  As noted by Naraine’s article the team crawled just over 35,000 websites using an automated system.  Interestingly the team discovered that about 3.23% of those sites were also infected with drive by downloads.  One aspect of the research I was curious about was the degree to which those infected porn sites were popular.  I spoke with Dr Wondracek after his talk to speak about the possibility of figuring this out.  In my own thesis last semester I discovered that of the sampled sites we receive from our data partners less than 3% of the those were listed as popular by Alexa.


To determine this one simply downloads Alexa’s “Top 1,000,000 Websites” list 2 and formats the list for comparison appropriately.  (Alexa’s list uses canonical hostnames) Then simply take the intersection of that list (find which hostnames appear on list A and list B) and use that to create a percentage.  This statistic should answer Pr(Popularity|Infection) or the probability of popularity given an infection.

[edit: moved links to bottom in footnote format for better readability]
0 http://weis2010.econinfosec.org/papers/session2/weis2010_wondracek.pdf
1 http://threatpost.com/en_us/blogs/understanding-porn-malware-connections-060810
2 http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

Tags , ,  | no comments

A Detailed Look at ThePlanet's Infection Distribution

Posted by Oliver Day Tue, 25 May 2010 16:17:58 GMT
A reader asked a question in the comments of the previous blog post about ThePlanet regarding the distribution of infections.  The reader wanted to know if the rest of the infections not attributed to Skenzo and HostGator were evenly distributed with less than 10% of the total infections.  The type of distribution the reader was describing is called a Power Law distribution.  A power law distributed population will look something like this:
power law
courtesy of Wikipedia.org

For this blog post I'm using data pulled from early May 2010 on AS21844 (ThePlanet) and find the infection counts are roughly power law distributed.  I've gone over the methodology to obtain this data in previous posts but it bears mentioning that I am using data distributed by RWhois organization names.  Later in this post I will look at the same data distributed by only IP address so that it can be compared with other AS blocks.  The raw infection counts look like this in graphical format:
raw data plot
The shape is precisely the same and it is obvious that there are a lot of organizations that have only single and double digit infections attributed to them.  The area between 500 and 2500 is entirely barren with only a single entry beyond 2500.  One of the issues when looking at data like this is the blur of data points below the 500 marker.  One could simply strip away the outliers (those data points above 500) but in this particular case I don't think that is an effective way to view the data.  In statistics people often "transform" the data to deal with this situation.  This generally means they divide all the numbers by some constant which allows the data to retain the same shape but become easier to read.  I generally favor the log/log method which means I take the log of each number and graph it that way.  Log (or logarithm) is a mathematical function best explained by Wikipedia but best thought of as a number "reducer" that can be applied uniformly across data.

To get a sense of the scale the log of 2500 is 7.8, the log of 500 is 6.2, the log of 100 is 4.6 and the log of 1 is 0.  Once the data is transformed we can see there is a little variance in the actual distribution but the fact that the line is sloping downward like that is another very good indicator of the power law distribution.
log/log data plot

Tags , , ,  | no comments