January 2010 - Smackdown!

Yesterday a friend of mine, Sebastian, wrote a post titled, “How do Majestic and LinkScape get their raw data?“. Basically it is a renewed rant about SEOmoz and their deceptions surrounding the Linkscape product that they launched back in October 2008, a little over 15 months ago. The controversy is based around the fact that moz basically lied about how it was exactly they were obtaining their data, which in part was probably motivated by wanting to make themselves look like they were more technically capable than they actually are.

Now, I covered this back when the launch actually happened, in this Linkscape post, resulting in quite a few comments, and there was more than a little heated conversation in the Sphinn thread as well. This prompted some people, both on Sebastian’s post and in the Sphinn thread on it, to ask why all of the renewed interest?

It is not extreme, its just that it isn’t new. The fact that they bought the index (partially)? That was known from the beginning. The fact that they don’t provide a satisfying way of blocking their bots (or the fact that they didn’t want to reveal their bots user agent)? Check. The fact that they make hyped statements to push Linkscape? Check. {…} I don’t get the renewed excitement. – Branko, aka SEO Scientist

Well, I guess you could say that it’s my fault. Or, you could blame it on SEOmoz themselves, or their employees, depending on how you look at it. You see, the story goes like this…

Back when SEOmoz first launched Linkscape, it would have been damn near impossible for a shop their size to have performed the feats they were claiming, all on their own. Rand was making the claim “Yes – We spidered all 30 billion pages”. He also claimed to have done it within “several weeks”. Now, even if we stretch “several” to mean something that it normally would not, say, 6 (since a 6 week update period is now what they are claiming for the tool), we’re still talking a huge amount of resources to accomplish that task. A conservative estimate of the average website, considering only html, is 25KB of text:

30,000,000,000 websites x (25 x 1024) bytes per website = 768,000,000,000,000 bytes of data (768 trillion bytes, which is 698.4TB)

(698.4TB / 45 days of crawling) x 30 days in a month = 465.6TB bandwidth per month

Now, I know that one of the reasons that Rand can get away with some of his claims is that most people just don’t grasp the sheer size