How To Block The Bots SEOmoz *Isn’t* Telling You About

Posted on October 17th, 2008 at 2:05 pm by Michael VanDeMar under blogthropology, coding, lackofmeds, nerdiness, On The Ball-ness, psychoblogging, scams, SEO, Social Media

I swear to tell the... wait, what did you say..? Ok, so, looks like Rand and gang finally decided to reveal their top-secret recipe about how they gathered all that information on everybody’s websites without anyone noticing what they were doing. There was quite a bit of hoopla over the fact that when they announced their new index of 30 billion web pages (and the new tool powered by that index), due to the fact that they never gave webmasters the chance to block them from gathering this data. In fact, they never even announced their presence at all.

While this is a huge breach of netiquette as it pertains to crawlers, at least today Rand finally announced that they are now disclosing their sources for data. In fact, this was how he worded it to the community:

we are now disclosing our sources for data – Rand Fiskin, SEOmoz CEO and really, really open guy

Better late than never, right?

The thing is, as I was looking over the list of bots that you would need to block in order to prevent mozzers from gathering your data, I noticed this subtle, easy to miss pattern in what they were listing. You have to look really, really close to see it, and the untrained eye might never see it at all, but luckily, eventually, I did see it for myself:

other data sources and additional crawls...?

That’s right folks, if you do decide to keep moz out by blocking all of the big guys (Google, Yahoo, MSN, Ask, Amazon, and Alexa), the lesser known guys (Dotnetdotcom, Grub, Page-Store, and Exalead), and that one fictional guy they threw in there (Gagablast), you still won’t have them blocked. Fear not though… after much work, I finally figured out that Rand was indeed true to his word, and that they did in fact release enough information to block the bots. All you have to do is add the following lines to your robots.txt, and you’ll be golden*:

User-Agent: *and other data sources
Disallow: /
User-Agent: Additional crawls*
Disallow: /

See? I got ya covered! :D

Seriously though, despite the fact that Rand and Co. are still less than forthcoming about all of the bots that are used (I’m guessing he doesn’t actually know, to be honest), something much more revealing is highlighted in this information, namely, they lied about having their own crawler.

Let’s take a quick review of some statements Rand made in the initial announcement about the tool:

  • Our crawl biases towards having pages and data…
  • As others who’ve invested energy into crawling the web…
  • our crawl biases towards this “center”…
  • Our process for crawling the web…
  • Moving forward, we’ll… invest in better and faster crawling…
  • In comparing our crawls against the engines…
  • we’ll be releasing more information about our crawl…

You also have statements made by moz employee Nick Gerner like this:

  • we’re crawling everything we can…

You even have him claiming bullshit like this:

We do prioritize the crawl according to pages we think are important. For now, and probably for the foreseeable future we’re going to rely on link endorsement to make that decision. Make good content, get good links. Keep it publicly available. We’ll get there soon enough – Nick Gerner, moz employee

That’s right, not only did they claim that they were crawling the web, they wanted us to believe that they prioritized how they crawled based on an importance algorithm! :D

I have to admit, Rand, it’s pretty bold to basically admit this late in the game that you guys lied through your teeth and grossly misrepresented the facts, just so you could appear to have accomplished a much bigger task than you actually did, all in the name of getting more money from webmasters. That’s a much bigger admission than saying you cloaked your bot, if you ask me. Gratz on coming clean.

*Disclaimer: The code I listed is sarcasm, btw. Those robots.txt lines won’t actually block anything, just in case you didn’t know.
Enjoyed what you read here? Subscribe to my feed.

  You should follow me on Twitter!

Be Sociable, Share!

78 Responses to “How To Block The Bots SEOmoz *Isn’t* Telling You About”

  1. Jay Says:

    I think the time has come to block all bots by default and to only allow choice bots to craw your site.

    user-agent: *
    disallow: /

    user-agent: goodbot1
    disallow:

    user-agent: goodbot2
    disallow:

  2. Michael VanDeMar Says:

    Jay, actually I don’t mind the crawling part. Unless a bot is actually putting a strain on my server, or from a scraper, it doesn’t bug me. The thing that got me with this whole deal is that I knew from the beginning that they didn’t suck down those pages themselves, and were just trying to look like they could do more than they actually can. It was feigned added value.

    I mean, think about it… even if they had a server farm that was sucking down 50 pages a second, 24/7, it would take them 600,000,000 seconds to hit the 30 billion pages that they are boasting. That’s 6,944 days, or 19 years to get the initial index built. They are claiming to have done it in 1 year, and that they will have enough new data every month to make for meaningful updates. There is no way they were gathering that data themselves.

  3. Andy Beard Says:

    They support meta noindex but notice they don’t mention support for meta nofollow

  4. randfish Says:

    Sorry about that – obviously, it was supposed to be Gigablast (lousy typos). Gagablast sounds like some sort of infant vomit :)

    And Andy – we treat meta nofollow the same way the engines do – they don’t appear in our link graph or any of the calculations. I’ll ask the guys to add that to the information page.

  5. Neyne Says:

    Now that is a response that is to the point. This changes the whole perception now that we know Rand meant Gigablast and not Gagablast. Phew. Already feeling better.

  6. Black Lynx Fillmore Says:

    hahaha!

    Mad weather.

    I will certainly be blocking all bots. Let’s build another internet, outside of Google and the CIA

  7. smallfish Says:

    So who’s behind dotnetdotcom.org? “few Seattle based guys” “Trust us” ? WTF!? Why are there absolutely no names on that site?

  8. Michael VanDeMar Says:

    @smallfish – they’re no one special, I’m sure. There are tons of small companies and organizations out there that grab pages to analyze the data. The index they make available for download only accounts for .01% of the index that moz is boasting, from what I can tell.

    Interesting to note that their claim of having spidered 7,011,296,117 pages since June 10th of this year would have required them to be able to spider approx 614 pages per second, and their counter is only advancing at 1 page per second.

    Of course, even more interesting to note that apparently exactly 7 billion of those pages were supposedly spidered some time in the past 3 days. :P Something’s not right with them.

  9. Doug Heil Says:

    Very troubling. You make excellent points Michael and in a very clear way. Something never seemed right to me from the very moment this new link thing was launched. Then, when Rand “finally” announced his sources about 1 1/2 weeks later, it all became extremely clear.

    Very troubling. I don’t believe something like this can be rectified. Character and judgment are separates many in this industry from others. It was oh so very clear at the start that moz wanted the industry to think that had an actual “crawler”. This is a huge difference between a “crawl” and a “crawler”. The mozzer backers can spin this all they wish, but the actual facts are hard to ignore.

    And yes; Michael is very correct; there is absolutely no way to block the linkscape stuff from your own site. Doing so just might be having you also block stuff you might not want blocked from the major search engines who actually give you something in return for the actual crawling of your websites.

  10. randfish Says:

    BTW – I tried to address many more of Michael’s concerns and explain more about the crawl and yes, crawlers that built the data in Linkscape’s index on the Sphinn thread – http://sphinn.com/story/79700.

    @doug – that’s not accurate. If you read our sources pages, you can see exactly how to block Linkscape from listing your sites/pages without blocking any of the engines. We obey the robots meta noindex tag, but also an seomoz noindex if you just want to target our index.

  11. Doug Heil Says:

    I did read it Rand. I read it when you first announced the data last week. It wasn’t clear to me at all. Is it now? If so, and we all noindex with a meta tag, how do we know it was followed? Considering the smoke screen perpetrated at the launch, we are simply suppose to trust things as they are? Further; how about “other sources” you are not disclosing? What if I want those sources crawling pages but not having to do with linkscape? Pages you already have in your tool from out there are going to be in there for how long? If the site puts up this meta tag noindex thing, how long will those pages be in… the ones in before you even launched?

    There is an implied industry specific code of conduct in regards to bots and tools and anything else having to do with our websites. Do you think you fulfilled this code of conduct with your tool?

    Further; why didn’t you simply tell all of us at launch exactly what was what?

    Further still;
    http://www.seomoz.org/ugc/offi.....ack-thread

    At the very start of that thread, “your” guy named Nick Gerner who has LOTS to do with the development of your tool also made it certainly seem like you had your own bot “crawling” the web. Please read all his comments and just try to come to some other conclusion. I defy anyone to know at launch that you in fact didn’t have your own crawler/spider.

    So now; let’s say we all do have a definitive way to block your stuff; That way is to stick in a site wide noindex tag just for you. In other words, opt out of your stuff. What do we get for all that trouble? We do this anyway for bad bots and for pages we don’t want indexed in the major search engines, etc, so what do we get from you for all of this? We either subscribe to your tool and have a tic for tac as competitors have our data so we can have their data, or we do nothing at all and get nothing at all for doing nothing. In fact; we get the priviledge of knowing competitors may have our data, including you. Tell me the exact benefits we get for simply doing nothing?

    Please note I am not speaking for my site at all as I don’t give a flying “F” about what others may know, but I am speaking for all other site owners who don’t know much and don’t know about this and never will. What benefit is it to them? Do they get free referrals from you or something, like you do with google? Site owners who know nothing even get referrals from the se’s for doing nothing but allow their pages to be crawled.

  12. Doug Heil Says:

    Oh, and this from Danny Sullivan;

    “I was surprised that blocking information wasn’t up at the time the tool was released. Even stealthy Cuil/Cuill put up blocking information *before* they released their product.

    Bottom line, if you’re going to crawl the web and to be a good web citizen, then you obey robots.txt blocking. If you don’t, then you’re not a good citizen in my books.

    Rand says blocking will be added, so great if that happens in short order. I know what it’s like coming off of a trip. What’s not clear is how quickly the existing data will be dropped. If they don’t recrawl that often, then site owners then site owners who don’t want to be in the database have to wait for a revisit, then wait for removal.

    Possibly they could add an instant remove tool, but that’s a lot of work (you have to verify who owns a site, etc.)”

    That was nine days ago, just after you launched. It’s VERY clear he was under the impression you had your very own crawler/bot considering what he wrote about things at the very start. You know he doesn’t consider me a friend, but he certainly does you. If even he was duped by all of this, it’s not a stretch why or how well all were.

  13. Skitzzo Says:

    Rand, when you so frequently admit making fairly significant mistakes, at what point do you expect to lose the goodwill of the community and no be trusted?

    As I said earlier, I would highly suggest you analyze the way your company works. I understand you’re busy but with the millions you got from the VC would it be that hard to have someone dedicated to putting out fires like this?

    The fact that it took you a week to release user agent info is just pathetic. The fact that it took you more than a day to remove an offensive post that was published on your blog is unacceptable. I realize you and the mozzers travel but hire a PR person for crying out loud!

  14. randfish Says:

    You’re accurate in the assessment that we didn’t reveal sources when we launched, and now we have. You can choose to block them, or just have us exclude your pages from showing in our lists of data. If you want to confirm, just use the tool – it will take until we re-crawl those pages that have the seomoz meta tag for us to exclude them (and we update once per month, so hopefully 30-60 days).

    And since we DO have spiders that fetch the data, and those spider names are now public, I don’t see the disconnect. I think this is playing at semantics and interpretation – our bots aren’t called “linkscape” or “seomoz” but we do have bots gathering data for our index – there’s no other way to build an index.

    I agree that when we launched, we should have disclosed the UA sources at that time – we thought during development it would be unwise, but have come around. I apologize for that delay.

  15. Michael VanDeMar Says:

    Ok, now I’m confused again as to what you are claiming.

    And since we DO have spiders that fetch the data, and those spider names are now public…

    Just to be clear, are you claiming that you have bots, but they are disguised as Googlebot, msn, etc? Or are you claiming that you own the bots belonging to those search engines? Or did you mean something altogether different when you listed those as your “sources” for data?

  16. Skitzzo Says:

    Also, wasn’t one of the biggest sales pitches for Linkscape the fact that other people’s data was inaccurate? And yet you’ve now revealed that you rely completely on that unreliable data of other bots?

    I think the issue here, Rand, is that you promoted so heavily your ability to crawl the web and gather your own data but now are saying that the value lies in your ability to compile others’ data into useful reports.

    Those are two drastically different things.

  17. Doug Heil Says:

    Wow; I’m with Michael.

    Rand; your last post makes no sense. I was clear before about how and why you lied at launch, but now I’m not clear about anything you are doing.

    I suggest you shut the heck up now.

  18. randfish Says:

    The source names are all listed – those we use now and those we may use in the future. The product was promoted as a link graph of 30 billion pages that we collected, and that’s exactly what it is. The collection process was done through spidering the WWW with the UAs and sources you see listed on our page.

    It feels like I keep saying the same thing over and over again. Am I not explaining it clearly? Maybe I’ll try one more time:

    #1 – Linkscape is a) an index of the WWW that currently contains 30 billion URLs. The crawl to retrieve this data was designed by SEOmoz, based on the metrics we created like mozRank and using the idea that we had a preference for domain diversity (reaching as many domains as possible) over reaching deeply into subpages and subsections on individual sites. This index is consistently updated – approximately once per month, with the first estimated update around Halloween. Future updates will focus on keeping the data fresh (i.e. re-crawling stuff we’ve already seen) and expanding into new territory that we haven’t seen (but have seen links to).

    #2 – Linkscape is also a tool that lets you see this link graph data for any page or site in our index. This includes links with lots of features (nofollow, image links, links that are hidden, etc.) and sorting (by any of the metrics or by searching for particular keywords in the title, url, or anchor text) and metrics (like mozRank and mozTrust).

    #3 – Linkscape’s data sources are listed on this page – http://www.seomoz.org/linkscape/help/sources – if you’d like to block individual bots, you can do that and if you’d prefer that specific pages didn’t appear in our list of links you can use the seomoz or robots meta noindex tag.

    Doug – there’s a big difference between lying at launch (which would be to claim that which is false) and not revealing, which is what we did. Now we’re doing neither – we make the claim, which is true, that Linkscape has crawlers, which we disclose, and we make the claim that we designed, built and continue to refresh the index, which is also true. The “lie” of omission is a fair one, but these others are not – anyone using the tool can easily see and test that it’s factual.

  19. Michael VanDeMar Says:

    Ok, so, again, just for clarity, when you say this:

    Linkscape’s data sources are listed on this page – http://www.seomoz.org/linkscape/help/sources – if you’d like to block individual bots, you can do that

    You are indeed talking about blocking Googlebot, Yahoo! Slurp, etc… you utilize data that is contained in your index that was initially collected by those bots, correct?

    I mean, let’s be very, very clear here. You do not list a single user agent on that page you linked to. What you list is a variety of sources that you obtained the data through, and each of those sources has user agents associated with them. Those bots can be detected via their respective user agents, and those bots can be blocked via robots.txt, since those bots all adhere to the Robots Exclusion Protocol. Would you say that all of that was accurate so far?

  20. Doug Heil Says:

    There simply isn’t a clear way to block the linkscape stuff. I said as much a few days ago in the feedback thread. Moz won’t give us a clear way either as doing so would not be in the best interest of his investors. This is a money kind of thing for sure. The hell with morals or any kind of code of conduct because mozville is a SEO.

    BTW: Michael; I see that “spin” place already shut down comments on your blog post. After one day, and over the weekend, and way before the start of the work week, sphinn.com shuts down comments. Ironic? No surprise to me.

  21. Doug Heil Says:

    MVD: I have a suggestion for you. Since comments are cut off at sphinn, many just might want to comment in here instead, but the fact your captcha is quite the pain, I suggest you change it. Sometimes it takes me ten or more tries to get it right. I’ve left in the past from you when trying to comment. I’m very sure it’s chasing away many posts.

  22. Michael VanDeMar Says:

    @Doug, to be honest I don’t think most people have an an issue with it. If it helps, it’s always a single 5-6 letter word (not just letters), no spaces, no junk in the middle. If it’s not obviously a word, flip the left and right sides.

    @Rand – still waiting for that clarification.

  23. randfish Says:

    Sorry – took a few hours off. The comment you made sounds nearly right to me, though I wouldn’t concentrate on just the two sources you mentioned. The list you’ve got in the post is currently comprehensive and we’ll update it anytime we use data from a new source. We specifically say that we may now or in the future use those sources, and that’s as detailed as we’re getting publicly for now.

    Doug – you are correct regarding investors and finances. This tool is not a noble effort to bring peace and freedom to the land. It’s made by a for-profit company, and the only way it will be successful is if the value it provides to members is consistently high. We designed it as a commercial product and the initial launch decisions not to reveal sources and the current messaging are intended to help protect the company’s interests and competitive abilities.

    BTW – The comments are kind of challenging with this captcha, but so far, I’ve been able to figure it out OK. I suspect non-native English speakers might have a more challenging time, though.

  24. Mert Says:

    Rand,

    Just to reiterate
    I quote your comment -“And since we DO have spiders that fetch the data”.- If you do have spiders, why not let the block be through a server robots.txt block rather than bloating the server. Can you imagine the code bloat if 30 other seo companies follow you with their own tools. None of your explanations above made sense to me. Your comments remind of me of the circling settler wagons fighting the Indians, trying to gain time till the imaginary cavalry comes. Mike, after your dynamic url blog on Google and now going after this issue I nominate you to the blog of the year.

  25. randfish Says:

    Mert – we are letting folks block through robots.txt. We list all the data sources and all have bots/UAs that can be blocked if you so choose. How did none of my explanations make sense? I don’t know how to be more clear about it… Maybe I need a whiteboard or something.

  26. randfish Says:

    @Skittzo – somehow missed your comment in the shuffle, so I’ll try to address that, too.

    Yes – a big part of the selling point is that other data is both misleading (as Yahoo’s inlink counts also include nofollows) and incomplete – no one else shows 301s/302s/meta refreshes. We also felt that data like Google’s PageRank wasn’t accurate in some cases (where they’ve penalized in the toolbar) and certainly wasn’t specific enough. In a logarithmic scale, which toolbar PR uses, a 5 might be anywhere from five to ten times more important than a 4, and not knowing whether your site is a 4.1 or a 4.9 is pretty frustrating – mozRank attempts to address that by showing precision to two decimal places.

    Other inaccuracies we’re trying to address include the time between PR updates – 3 to 9 months, with no visibility on when the next update might be. We update every month and the data pushes out at that time.

    There’s also no visibility into anchor text or image links or many of the other flags that we assign to links in our graph from any of the other tools and Linkscape’s data is meant to be accurate in that way as well.

    I did not mean to suggest that Google or Yahoo! had an inaccurate crawl of the web – I don’t think anyone came away with that impression. I meant that link data is frequently poor, old, inaccurate and not detailed enough and Linkscape addresses those issues through all the points I mentioned above and many more.

  27. Gab Goldenberg Says:

    Mike, did my earlier comment get held up in moderation?

  28. Michael VanDeMar Says:

    Rand, for starters, telling people that if they want to block your tool via robots.txt then all they have to do block every major search engine plus a few lesser known ones is:

    a) ludicrous, and

    b) a lie.

    Even if it was a reasonable suggestion (and let’s again be very clear, it is not), you put right at the bottom of the list, “Additional crawls from open source, commercial, and academic projects.” So, according to what you listed, unless someone were to block everything via robots.txt, one or more of your “additional” sources might still gather the data, and it still wind up in your index.

    Additionally, this brings us back to where you claimed dominion and control over the spiders that collected the information. You stated that you had an algorithm that determined which sites and pages they spidered next. You based these claims on the basis that Linkscape had its own crawler, and that simply is not true.

    @Gab – I don’t see any other comments from you. You’re on the approved list, they would go straight through.

  29. Doug Heil Says:

    I’m not getting this at all Rand. If people in this thread alone are not getting it, what makes you think the industry gets it either? In my opinion the only people who might get it are the people who just don’t care….IE: mozzers in your little circle. They don’t care what you do or how you do it, so they don’t want to understand the process either. You could screw them very directly and they wouldn’t know it. LOL

    Many others do care about a firm who attends most every search conference, and speaks at all of them, and promotes themselves to the ninth degree, and is buddies with the leader.. Danny Sullivan. We care about how that firm is doing business. In YOUR own words: “are intended to help protect the company’s interests and competitive abilities.”

    So do WE want to protect our competitive abilities as well. What works for you also works for the entire industry. I don’t like the FACT that unless I PAY/SUBSCRIBE to YOUR tool, I don’t have the data that my client’s competitor does. I don’t like the FACT that I cannot block your shite using robots.txt unless I also block google and yahoo and MSN as well. Come on now. LOL

    I don’t like the FACT that I get a little meta tag from you to stick on all my pages, and MAYBE they won’t be in your index… just maybe. The thing is; the page might not be included, but what about the links? Blocking VIA robots.txt if Farrrrrr different than blocking using a meta tag.

    I also think it’s funny stuff that mozzers everywhere think they are getting some great data all about links. We all know however that your tool does not tell you what Google actually thinks about that link. Not sure how relevant your data could be. Michael Martinez had a great post about this exact thing. I agree with him totally.

    That being said; the way you launched this tool was very deceitful. You seem to find a way to piss me off in the end, and after I had grown to actually think your firm just might be “best practices” after all.

    There is such a thing as an implied code of conduct within the SEO industry. There is such a thing as reasonable expectations of good business practices among like minded firm’s, and firm’s who want to do good for the industry. In my opinion, you have failed at this on all counts. I’m greatly disappointed with most everything involved. You might not think little o’l dougie’s opinion means a damn thing….. probably not, but we shall see.

  30. Marty Martin Says:

    @DougHeil Linkscape to me is a huge boon in the competitive intelligence landscape as well as a way for me to objectively evaluate my own website content over time and in comparison to my competitors. What @randfish and team have accomplished is putting together a lot of disparate and fractured content into one convenient place and applied an (arguably) objective evaluation of it.

    Competitive intelligence is a factor in every business sector and the Internet and, more specifically, SEO is certainly no exception.

    So what if @seomoz didn’t provide this date. You can get similar data from other places, but now I don’t have to pay as much for it, which for me is a huge advantage. Yes, my competitors can also learn about me, but they could in the past too so it’s really not that much of a change.

    I know this comment is probably going to get flamed for “sticking up for the mozzers” but I’m really trying to be fair and objective. I know and understand (at least I think so) why you don’t like it, but as I said, this data is already out there. The only way to hide it is to take your website offline or go members-only and not allow any indexing which is impractical.

    @randfish and team have taken data that normally would be out of the price range of small to medium SEOs and placed it within our control, which I for one, am glad to have.

  31. Michael VanDeMar Says:

    @Marty – I just want to make it clear that nothing I am talking about in this post has to do with what I think that the the tool’s actual value is. I only used the free version, and I personally wasn’t that impressed, but that’s not what this post is about. I build tools that do rely on freely obtainable data, but that put all of that data in one place, and automate querying it, and there is value in that. Even if I think that moz is way overpriced, I am not saying that is worthless, and honestly, that’s not the issue at hand.

    The issue is the selling tactics of presenting the tool as if it were built by a company capable of doing much, much more than what they actually are. Anyone can build a gocart in their garage. If someone builds gocarts, and sells them saying they were built using Porche motors, or tell you that they’ve been building gocarts for thirty years, then it better damn well be true. If you bought one under those premises, and then down the road you discover that they used Honda motors and only started building them 6 months ago, you’d probably be pissed, regardless of how well it ran.

    Rand billed his tool as being something that he and his company built by having spidered 30 billion pages themselves. This was a lie, plain and simple.

  32. Doug Heil Says:

    Marty; Fair point by you. That being said; considering that I know SEO as well and sell MY services in regards to that, why the heck would I use another SEO’s opinion about what value some link has? That’s nutty. You seem to be a SEO as well, so I have to wonder why you value judgment from another SEO over your own tools and data and judgment?

    Anyway; I agree with the above as I really don’t care about his tool or the value or lack of value at all. That’s not my concern as I would NEVER personally use some silly tool out there to tell me what is what. Don’t have to. No need to. I do care about why and how a firm in our industry goes about their business under the guise of “best practices”. What the moz firm has done and how they went about the launch and what they have said about the tool from the very beginning, is about as far away from “best practices” as I know of. Period.

    That’s the issue. That’s the point.

  33. Michael VanDeMar Says:

    Also, Rand, it’s blatantly obvious that even the industry leaders assumed that you guys had done all of the spidering yourselves. For example, when Rustybrick wrote this:

    SEOmoz launched a new tool called Linkscape, the tool utilizes a growing index of 30 billion pages indexed by the SEOmoz team to plot detailed linkage data for SEOs. In the past, SEOs relied on search engine data from APIs or via a scraper. SEOmoz decided to crawl the web themselves to get this data.

    If your intent truly was not to deceive, then why did you make no attempt to correct this misconception? In fact, you did quite the opposite. According to Rustybrick:

    Update: I spoke with Rand and he said they will announce a single useragent that people can use to block the SEOmoz bots from crawling their site.

    That is something that you could only do if you did in fact control all of the data sources from the crawling end of things. You had to have known that. This isn’t even close to “semantics”.

  34. randfish Says:

    So – I think the reason you’re upset is because we didn’t release a single UA, but instead a list. In that list might be sources we built and designed to crawl the web and others we use to bolster our data if/when those sources can’t reach data that we know we need to build a comprehensive index.

    It was not a lie to say that we built a spidering process or an algorithm to determine crawl rate or designed the crawl itself and the de-duping parameters, the way link juice flows, or any of the other claims. None of them were untrue. The problem I see here is that we’re being intentionally deceptive in order to protect the quality of the product and discourage competitors from being able to copy what we do.

    However, I stand by the fact that a lie of intentional omission is not the same as a direct untruth. Your accusations are lies in that they are not true. My claims are true, but don’t reveal everything.

    On the last point regarding Barry’s post on SERoundtable – I think that was a miscommunication. I said that we’d be revealing ways to block us and ways to avoid inclusion in our results, but Barry took that to mean a single UA. I do apologize for that – we chatted only briefly in person and at that time, I hadn’t even met with the team to figure out a protocol, and knew only that we were going to be providing a way to be excluded from the public index.

  35. Michael VanDeMar Says:
    So – I think the reason you’re upset is because we didn’t release a single UA, but instead a list.

    Um, no. And again, you have to know that’s not what I am saying, and this has to be deliberate obfuscation. That’s actually kinda rude, Rand.

    Let’s try for a couple of simple direct, yes or no, don’t require expounding on, questions. These are not trick questions, and I would ask that you please at least start your answers to them in a yes/no format:

    1) Did your company spider the 30 billion pages used in the Linkscape index?

    2) If the answer to #1 is No, then would you say that your company spidered most of those pages?

    And before you try and redefine the term, spidering a page means going out and fetching it from the server that it is hosted on.

  36. randfish Says:

    1) Well, our engineers built the crawler, built the spidering method, designed the crawl and our funds paid for the crawl, hosting and processing bandwidth/storage, so the answer is yes. The bots that crawl did not come from SEOmoz’s servers nor did they use a UA other than what’s listed on the sources page.

    The reason this is confusing is because we’re being intentional about not revealing the whole story. I know that’s frustrating, but it’s the truth, and so are all of our claims. Like I said – for competitive and quality reasons, we’re not telling the whole story here.

  37. Neyne Says:

    @rand I have a technical question. Let’s say I want to block the Linkscape from accessing my site’s information. And let’s say I am so upset about this whole thing that I am willing to block all the bots from your list, including the major SE bots, in addition to adding the meta that you provided. How does it work from now ? I have to wait until all of your data providers provide you with a dump that includes the updated version of my site ? Which they will not have since I blocked them. So you will have all of my site’s data up to the point when i blocked the bots ? Or do I need to first add the Linkscape excluding meta tag, wait for your data providers to cache that version of the site so your script can “understand” that it is being blocked and then I block the robots from your list ?

  38. Michael VanDeMar Says:

    I’m sorry, but you are now claiming that your engineers built Googlebot, Yahoo! Slurp, etc., since those are all listed as your sources? You didn’t simply buy the data available from Exalead and Gigablast, you designed the crawl engines that power them?

    The crawl is the process of getting the pages from the servers, not “filter the data we buy from you by x, y, and z criteria”, just so you know.

  39. Tummblr Says:

    SEOmoz says that the meta tag with name=seomoz can prevent Linkscape from using a particular webpage in their data. That means that SEOmoz does have a bot that retrieves webpages. The fact that they have nofollow link data also means they must download the actual contents of webpages. Or am I missing something?

    So why can’t we get the UA for SEOmoz’s bot that downloads webpages? I don’t want to add a meta tag to every single one of my webpages just to block Linkscape. I would be so much simpler just to add it to robots.txt.

  40. Michael VanDeMar Says:

    Tummblr, if they purchase the data from someone else then that would not require them to do their own spidering. As long as non-standard meta tags are included in the data that they purchase, then they could just filter it out by querying that data.

  41. Tummblr Says:

    The actual contents of 30 billion webpages are available for sale or collectable from search engine APIs? Wow. I was thinking they buy/collect 30 billion URLs, but download the contents of those webpages themselves.

  42. Michael VanDeMar Says:

    Well, see, that’s just it. Rand is now claiming that despite the fact that they list 10 known and an uncounted additional unknown sources for the data, they did indeed design the spidering engine and crawled all 30 billion pages themselves (see comments #35 and #36 above). He has yet to explain what he means by that, but that is his claim.

  43. randfish Says:

    Michael – I’m most definitely not claiming we built Google or Yahoo! – that’s ludicrous. My “claim” is exactly what I stated above, and that includes the crawl fetching portion. Are there additional direct questions you want me to answer related to that?

    Neyne – If you block all the bots, it means that we won’t be able to grab future data from you. Our index updates once monthly, so, like Google, we’d have to re-crawl, see the block and then exclude you from the next index update. With the meta tag, the same is true – you’d need to have it up when we come and visit, so that it would be pushed to our index and wouldn’t show in results. I realize that means a delay, but this is the same way it works with search engines. If you block Google, they won’t remove you until they recrawl and rebuild their index.

    Tummblr – the 30 billion pages we’ve collected are not for sale by any engine or provider I’m aware of. You can buy some crawl data from some of those providers and we may, now or in the future, use those purchasable sources to bolster our data.

  44. Jill Says:

    Just a guess, but perhaps Rand means that they have a spider that crawls all those databases he listed where he gets his data from? (If that’s the case though, I’m not sure why he doesn’t just say so, so that’s probably not it.)

  45. Michael VanDeMar Says:

    No, Rand, again, you cannot redefine terms at your convenience. I clearly stated this in my question:

    And before you try and redefine the term, spidering a page means going out and fetching it from the server that it is hosted on.

    With that caveat you answer that yes, you guys spidered 30 billion pages, just like you initially claimed to have done. Since you list as your sources for where you got your index all of the major API’s and 2 places where you can purchase the data, either your engineers built those bots that did the work (which would be what you claim when you both claim those as your sources and say that “our engineers built the crawler, built the spidering method, designed the crawl and our funds paid for the crawl, hosting and processing bandwidth/storage”), or those things you listed are not actually your sources.

    Those cannot both be true. This isn’t a word game, Rand. Downloading data that someone else compiled, data that would have been spidered even if you did not exist, is not the same as you spidering that data yourself.

    Which is it? Did you spider all 30 billion pages, or not? We’re not talking about whether or not you spidered any of the data… did you or did you not spider all of it?

  46. randfish Says:

    Yes – We spidered all 30 billion pages. The sources and UAs are listed on the sources URL.

    No – We did not build bots for the major search engines.

  47. Tummblr Says:

    I’m so confused. Simple yes or no question:
    Does SEOmoz directly download any content from my website?

  48. randfish Says:

    If you’re asking, does a bot named SEOmoz come grab data, the answer is no. If you’re asking whether bots come and grab data that goes into Linkscape, the answer is yes and the list is at the sources URL – http://www.seomoz.org/linkscape/help/sources

  49. Michael VanDeMar Says:

    Ok, Rand, look at what you are saying. In the “Yes” portion above:

    “We” = “The sources listed”
    “The sources listed” = Google, Yahoo, Exalead, Gigablast, etc., therefore
    “We” = Google, Yahoo, Exalead, Gigablast, etc.

    And Tummblr, the question really is, did SEOmoz directly download all of the content from everyones site in their index. That’s what he is saying.

  50. randfish Says:

    That’s not what I’m saying or what I’ve said. I’m saying three things:

    1)Linkscape has an index
    2)That index was built by crawlers we control and may, now or in the future, pull data from any of the sources listed.
    3)We didn’t build bots for any of the major search engines.

    I think my above answer to tummblr holds true for your modified version of the question as well. Sources listed on that page did download all the pages in our index for use in Linkscape.

  51. Hugo Says:

    Many years ago, I had an exchange with Rand about building a tool that would make the toolbar PR obsolete.

    Unfortunately, I feel that while that was likely a prime motivation for “Linkscape,” it’s missed the mark while also infuriating a significant portion of the SEO/webmaster community.

    I just hope that exchanges like this (and discussions like the ones that are occuring in Sphinn and twitter) will eventually lead to such a tool.

    And Rand…I know that you’re in a tough spot…trying to protect the value of your product by withholding certain information…I just think that in order for this product to ever truly reach it’s potential, you’ll have to be 100% forthcoming with the mechanism that powers it.

    But then again, what do I know?

  52. Marty Martin Says:

    This is really becoming a circular argument. The impasse seems to be there are some folks who want more information, and SEOmoz for their own reasons (however you choose to subjectively interpret them) are not going to tell you. They have a (I’m sure) significant investment in labor, research and marketing involved and are not going to throw out the baby with the bathwater.

  53. Tummblr Says:

    Thanks for clearing that point up. Since SEOmoz does not itself grab page content from my website, I gather that the creation of Linkscape did not add to my bandwidth costs.

    One of my worries was that aside from G/Y/M, now there is yet another entity fetching every single one of my webpages (with no benefit to me). Glad to hear that is not the case.

    Even if bandwidth is not an issue, there’s still the matter of competitive intelligence. It would be nice for webmasters to be able to submit a request to have their entire domain removed from Linkscape, without adding a meta tag to every single page. Pretty please? :)

  54. Kev Says:

    @Rand

    From what it sounds like you have enlisted the help of bots built by a third party(s), but work according to your requested specifications, and will continue to work in while you keep the company(s) that you are working with on retainer.

    Let’s peel back the layers.

    Does Linkscape utilize a first party solution for getting documents into its index? In other words, is there in-house technology at SEOMoz that actually makes contact with servers across the web to put into the Linkscape index?

    I think the answer has become a pretty clear “no”. From the statements I’ve read, and the UAs released, I can only assume that the bots that are directly in contact with the pages being downloaded are built by outside sources.

    Next, where do the bots commissioned for the Linkscape tool download their content to? Are the spiders visiting pages and downloading them directly to servers under the control of SEOMoz?

    This is something I’m not clear about. My guess is that the companies that have built the bots get the data directly to their own servers and SEOMoz gathers it from there.

    Last, are the spiders actually behaving according to SEOMoz’s specifications?

    More specifically, are the spiders working in a such a way that you have designed them to, or are you simply being selective about the information you choose to receive while they act as they usually do? What level of control do you have on the actual behavior of the bots as they crawl the web?

    And, I want to be clear, I’m not supposing that any of my assumptions are really how the tool works. I have come to those conclusions based on what I’ve read and what I feel I have been lead to believe, any clarification would be extremely welcomed.

  55. randfish Says:

    Marty – yeah, we do have some things, mentioned above, that we’re not disclosing.

    Tummblr – that’s not exactly right. The crawls that feed into SEOmoz do use bandwidth, but we don’t download plug-ins or images or scripts so it’s just the raw HTML, which significantly reduces the amount of bandwidth compared to something like Google/Yahoo/Ask/Archive/etc.

    Of course, when we backfill with these folks’ data, we’re not incurring any additional bandwidth.

    On the domain side – we’ll certainly look into a way to block an entire domain, but we’ll probably need to build a verification system for webmasters and site owners first.

  56. Mert Says:

    “That index was built by crawlers we control and may, now or in the future, pull data from any of the sources listed.” Rand all the noise here is about having a respected robots block command to the indexing crawlers (Dan Thies’ original idea he sent you over the email). Rand, I have never seen you circle around the topic before. You have to admit; some of your closest friends in this field opposed you on this issue. I think I will refuse when someone offers me VC money next time. Corporate SEO sucks. I realize this is a make or break deal for your VC agreement. Just to get to Aaron Wall’s blog post on this topic, how much longer can you survive the negative branding. Anyhow, I am out to block my websites to non search engines. It is a shame this issue came to this level. Whether or not your opponents in this issue are right or wrong even Ali did not pull this much Rope-a-dope in a boxing match as you did in Sphinn and here. I wish you and your partners the best in your ventures but it seems this will hurt you as much as it will help you. Good luck.

  57. randfish Says:

    Thanks Mert – I think that anytime we can’t have full disclosure, and this is certainly an issue where that does exist, we’re going to have misconceptions and confusion. For that, I certainly apologize. I hate to run around in circles and I feel that I’ve tried to be as direct as I can without compromising the information we need to protect.

  58. NotHappy Says:

    Rand, if you build a crawler that has indexed 30 billion pages what is the robots.txt UA or IP addresses to block “that” bot?

    For now i don’t care about your long list of random other well known bots you may or may not backfill data from.

    I want to know the UA or IP of YOUR bot scraping 30 billion webpages that you claimed to have engineered from the gound up.

    I control 23 servers hosting many thousands of sites and i want to firewall your bot from the lot.

    Thank you and bookmarked to receive a reply.

  59. randfish Says:

    NotHappy – the list of sources on the page is as far as we’re going with disclosure (primarily for competitive reasons). However, that list does include all the sources we use, and they all obey robots.txt. You can also keep your results out using the meta seomoz noindex tag.

  60. Doug Heil Says:

    Goodness; Please read my last post prior. I stand by that.

    Ridiculous.

  61. Doug Heil Says:

    Let’s see now; I’ve read all the new threads at sphinn. I’ve read the blog posts in here. I’ve read pretty much everything Rand has had to say on the matter.

    OK; Each one of my clients pays Rand $79 per month for the privilege of seeing his views on link values, etc.

    OR:

    Each one of my clients can stick up a Rand meta tag that just might keep all pages out of his Linkscrape tool. Each one of my clients also gets the benefit of having another SEO’s meta tag on each page of their site. OK fine.

    SO:

    Shouldn’t Rand pay each one of my clients an advert fee? I’d say about $79 per month should be OK.

    BTW: I have zero intention of using this meta tag… which is bogus anyway… to block pages from Linkscrape. SEOMOZ better give us a GOOD way to block his stuff via robots.txt. If you don’t do this; you can bet NO ONE worth two shites is going to consider SEOMOZ a best practices of anything in the future. As it is right now; I highly doubt you ever win over the people in this industry no matter what you do at this point, because of the shabby/shitty/sleazy/slimy way you launched your new product. I could give a rats ass any value the tool has at all as that is not the issue or the point of why I am pissed off.

  62. Marty Martin Says:

    Here’s the thing though, if you use a meta tag or an entry in robots.txt, and you’re worried about your competitors finding you or your link network, you’re still going to be hosed because a google search for-

    “robots.txt” “dotbot” filetype:txt

    will quickly reveal who’s using a particular UA exclusion. Same for meta tags. The Internet is too open to hope to hide anything unless you disallow everything or even better, stick it behind a firewall of some kind.

  63. Bruno Says:

    Great post. I think, what happened here is they drew the line too thin on ‘hype’ for their new product. I for one thought they hype was over rated….. and clearly so did many…

  64. Michael VanDeMar Says:

    Marty, the vast majority of the search you mentioned will show people talking about robots.txt, not individual robots.txt files. As a general rule those only get indexed if someone links directly to them… just like any other file on the internet.

    As for people’s reasons for not wanting their information in the index, it’s simply their right.

  65. NotHappy Says:

    Just as i thought Rand, the old “Can’t say competitive intelligence” line which puts SEOMoz at the bottom of the barrel with all the other scrapers. Actually they are better, they are harvesting it for things like spam which i can identify and do something about.

    So looks like your Dotbot (which may be a red herring anyway) has been issued a /28 so:

    # iptables -A INPUT -s 208.115.111.240/28 -j DROP

    That will block 208.115.111.240 – 208.115.111.255 at the firewall (The Dotbot site is using 208.115.111.242 if you visit in the browser)

    BTW Rand, how did you receive a /28 IP allocation? Running a scraper network is not valid ARIN justification. Might have to send off an email.

    Anyhow if everyone issues this command:

    # iptables -A INPUT -s 208.115.111.240/28 -j DROP

    Alternatively you can block that IP range with .htaccess if you don’t have root access, and i will dig in to the logs and see what this LinkScraper is doing.

    Sorry to say, all respect i had for SEOMoz is now gone and i will not be visiting it again.

  66. Doug Heil Says:

    @Nothappy; I would appreciate it if you could please write up a blog post about how “exactly” website owners and webmasters can block this bot. If not you; someone you give the info to would suffice. I think it would be very beneficial to the industry. This bot is like any o’l rogue bot not wanted on our servers. It needs to be dealt with by a good majority. What I mean is, make things very clear so as a new webmaster/owner clearly sees what to do step by step.

  67. Michael VanDeMar Says:

    Well, blocking by IP is ok as long as they don’t decide to switch or add new IP’s. The thing is, that’s one bot, and I still firmly believe that it only accounts for a small amount of the collection process they are claiming. It’s the only way that things add up.

    Also, since that’s like closing the barn door after SEOmoz kicked it in and raped the horses, here’s a possible method for dealing with what they took already:

    http://smackdown.blogsblogsblo.....-meta-tag/

  68. NotHappy Says:

    Doug (Or anyone else) feel free to write a post with this info, but as Michael points out it’s not a single bullet proof solution because they can buy more IP’s and it doesn’t stop the “other sources”. But it will block the /28 IP block that has been assigned to Dotbot.

    The /28 is a block of 16 IP addresses, in this case from:

    208.115.111.240

    to

    208.115.111.255

    So if you firewall those 2 IP’s and the 14 in between you will be blocking the Dotbot IP range. The Dotbot site itself is hosted in the http://208.115.111.242 IP.

    In the past i have found the best way to deal with situations like this is to target the data sources rather than trying to fight the offender.

    Our company is writing letters to all data sources on the LinkScrape page besides the big 3 (G, Y and MSN) notifying them we have completely blocked their crawlers from 23 webservers encompassing thousands of websites due to the data harvesting practices of SEOMoz.org who are utilizing and selling their data services.

    I doubt these companies will be impressed when the value of their services are lowered, user experience reduced etc because of some company in Seattle flogging $79 a month subscriptions with the data. Especially when they realize they have been blocked from sub 5,000 Alexa sites due to SEOMoz.

  69. Lance Says:

    Interesting enough, the project’s code name was Carhole, and the registered name for dotnetdotcom.org is Jeff Albertson. Jeff Albertson just happens to be comic book guys name…………

  70. Steve Says:

    I have also been ordered to fight this on our European servers that currently host 32,000 websites and complaints have been filed. I think this will have seriosu repurcussions.

  71. Zaphod Says:

    Thanks for the pertinent info on these goofballs. I have also found that blocking them by user agent helps too. The updated signatures for {my spammy product} will be able to deal with these keyword scrapers even better.

    I did try to block them before by domain, but they weaseled around that, the give-away was their user agent showing up from other IPs. I just wanted to make sure (and this page helped me decide) that I wanted to deep-six their crawler from wherever it came from.

    I personally would rather toss them a 403 a few times with reasons, than give them a robots.txt . Most scrapers, and bot probes (laycat.com) just ignore robots.txt anyway. After they hit {my spammy product} 3 times, it switches them to a 503 permanently. Quite better at bandwidth saving.

  72. Ryan Says:

    Sorry to resurrect an old thread, but has ANYONE figured out how to block linkscape outside of the ridiculous meta tag?

  73. Zaphod Says:

    Well pardon me, I was just trying to help. Seems a GNU General Public License author can’t get any respect anywhere.

    Please deleted this post, and the previous one I made. I surely don’t want to be associated with the likes of you!

  74. Why The Renewed Interest In The Linkscape Scams And Deception..? | Smackdown! Says:

    […] I covered this back when the launch actually happened, in this Linkscape post, resulting in quite a few comments, and there was more than a little heated conversation in the […]

  75. Yatin Says:

    @Ryan – you can block the linkscape UA via robots.txt

    User-agent: rogerbot
    Disallow: /

    ( Source: http://www.seomoz.org/dp/rogerbot )

    personally, i love the seomoz tools so I wouldn’t be blocking it.

  76. Michael VanDeMar Says:

    Actually, Yatin, that is a different tool that they launched last May. Also, based on that tool showing up in Project Honeypot, it looks like that tool isn’t obeying robots.txt either, regardless of what they say. The Linkscape tool still doesn’t identify itself as such to this day.

  77. Yatin Says:

    but Michael, as per seomoz blog rogerbot is the name of their crawler and although i haven’t checked my server logs to see visit by rogerbot, i do NOT think :

    a) seomoz would have two different crawlers

    b) Rand would risk brand equity by ignoring robots.txt like spam bots do ;)

  78. Michael VanDeMar Says:

    Regardless of what your misconceptions about Rand’s ethical practices are, Yatin, that is a completely different bot from Linkscape, and the rogerBot you are discussing has in fact wound up in Project Honeypot’s database.

Leave a Reply