If Googleplex Employees Don't Understand The Webmaster Guidelines, How Can They Expect Webmasters To Adhere To Them?

Nancy Drew, we need you! Last month, in one of my posts, I had decided to include a few images to see how well they might rank for their keyphrases. I had never targeted any of the image searches, and due to one of the topics of the post it seemed like a good opportunity to do so. When I went to check later to see if the images happened to be indexed yet, they weren’t. It had only been a couple of days, and I really didn’t expect them to be there yet, so that was really no surprise. What did surprise me, however, was that as it turns out none of my images were indexed. In fact, nothing whatsoever from that subdirectory was currently in Google’s index, Image search or otherwise.

I do know that at one point the content that directory contained, which is where I put all of my support files for posts (I use FTP instead of WordPress upload functionality), was in fact indexed… which just made the fact that it was now missing an ever bigger mystery. I checked all of the usual culprits (broken links, robots.txt, .htaccess, etc.), but everything checked clean. Donna Fontenot suggested that perhaps it was something wonky with the way I was handling my subdomains, and it occurred to me that perhaps my old evil host, DreamHost, might have done something uncool to cause it to become deindexed. Since neither was something that lent itself to easy diagnoses, I decided to pop over to the Google Webmaster Help discussion on Google Groups and ask there if anyone saw something I might be missing.

After a few posts, Googler John Mueller chimed in and gave me a few tips for getting images indexed in Google Images, pointed me to some other posts on the topic, and suggested that I opt-in to “enhanced image search” in Google Webmaster Tools. I was pretty sure that I had done most of what he suggested, but went back through the images to make sure, tweaked a couple, read the blog posts he suggested, and added the “enhanced image search” (which allows people to label, or tag, your images in Google Images, from what I understand) to the Smackdown profile in Google Webmaster Tools. After about a month, still no love from Google as far as indexing the images (or anything else in that directory, for that matter). In fact, from what I can tell, Googlebot has not even looked in that directory this month. So, I posted an update.

John replied with the following:

So I took your post as an excuse to take a better look at your site 🙂 and found a few things which I’d like to share with you. In particular regarding your /images/ subdirectory I noticed that there are some things which could be somewhat problematic. These are just two examples:

– You appear have copies of other people’s sites, eg /images/viewgcache-getafreelinkfromwired.htm
– You appear to have copies of search results in an indexable way, eg /images/viewgcache-bortlebotts.htm

I’m not sure why you would have content like that hosted on your site in an indexable way, perhaps it was just accidentally placed there or meant to be blocked from indexing. I trust you wouldn’t do that on purpose, right? – John Mueller

When you do a search on Google, next to the results in most cases you will see a “Cached” link. When you click on it, you see a snapshot of the page as it was in the past, when Google crawled it. Since webpages are subject to change, this cached copy can help to explain why a certain page is showing up in the search results, even if the content doesn’t match what you searched for when you visit the page on that day. Google explains the cached contents as such:

Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable.

The content in question on Smackdown was most definitely placed there on purpose. They are not “copies of other people’s sites”. The pages in question are cached copies of webpages or search results that relate back to the discussion on hand in each and every case. They are relevant support files to back up and demonstrate what I am talking about at the time I make my post, because I know the content I am talking about won’t be the same later.

Some of my posts discuss Google’s search results, and as such a few of them contain cached copies of searches. John hints in that post that I am in violation of Google Webmaster Guidelines by doing this:

At any rate, I’m sure you’re aware that this kind of content is not something that we would like to include in our index and which is also mentioned in our Webmaster Guidelines. – John Mueller

I went through the guidelines carefully, but for the life of me could not figure out what John was referring to. That is, until this morning, when I went though them again, and realized that the only thing he could be talking about was this like:

Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.

Ok, fine. Here’s the problem… that guideline has nothing whatsoever to do with the cached pages I have on my site. This is a relatively new guideline, put in place of March last year. Matt Cutts explained the change on his blog when it happened. It relates to auto-generated mass produced low quality pages that some sites would use to boost their content, taking the shotgun approach to getting something ranked. In fact, the type of content that guideline suggests blocking is exactly the type of content that Google has decided to start indexing on it’s own, with very little control being given to webmasters (since now Googlebot crawls HTML search forms). What it is not referring to is a few odd pages that were cached in order to facilitate discussion.

I’m pretty sure that John is wrong on the reason why that entire subdirectory has gotten deindexed. It doesn’t make sense from the standpoint of the guidelines, and also would not explain why the other files within that same directory weren’t getting indexed either. Additionally, on some datacenters there are now entire posts, ones that are still linked to from the front page of my blog and have a few other decent links to them, that have been deindexed. It could of course be completely unrelated, but something does appear to be going on somewhere. I mean, I am more than happy to submit a reconsideration request, to have someone who can “look behind the curtain” have a peek, but why on Earth should I have to block relevant content from Google? The thing is, if a Google employee can misread a reference in the guidelines and think that may be the cause of the problem, what chance do normal everyday webmasters have in getting it right?

Quick PS: In case anyone at Google reads this, I’d thought I’d let you know… the example link for the Cached feature on this page points to a datacenter that is no longer functional, namely 216.239.53.104. Just in case someone wanted to update it to a working link. 😀

6 thoughts on “If Googleplex Employees Don’t Understand The Webmaster Guidelines, How Can They Expect Webmasters To Adhere To Them?”

Nick Wilsdon

May 27, 2008 at 9:50 am

I think we’ve jumped the shark as far as examining Google’s Guidelines go. It’s become fairly obvious over the last year that they encompass any activity which (1) Google doesn’t approve of, (2) sees as threatening to their business model or (3) has succeeded in manipulating their SERPs.

I don’t think we should get bogged down in the details anymore. Maybe just replace them with that last sentence? 😉
Michael VanDeMar

May 27, 2008 at 9:59 am

Thing is Nick, whatever the guidelines are, there needs to be some sort of logic behind them, which points 2 or 3 you mention need to fall under. General disapproval without reason isn’t a guideline, nor something that could be followed… it would boil down to whim if that were the case, which I am sure is not the intent here.
Reuben Yau

May 27, 2008 at 12:12 pm

It sounds like what you need to do is separate the cached pages and images into different subdirectories.

Interesting that G is now also able to choose specific parts of a site it wants/does not want to index.
Nadeesha Cabral

May 29, 2008 at 4:46 am

Your frustration is pretty understandable. And in this case, I’m surprised that John didn’t realize the pages in question were indeed cached pages.

Anyway, I also believe that while combating the 1,248,531 people trying to exploit Google, they have been extra defensive about a few things and because of these, the legitimate users are also placed at a disadvantage.

I guess that’s just how the world works. I hope you’ll be able to resolve the issue soon.

Cheers!
Adam Beazley

May 30, 2008 at 12:12 am

In the future, why don’t you just take a screen shot of the cached search results. This way you can accomplish what you want and Google wont penalize you for it. It beats fighting with an algorithm if you ask me.
Michael VanDeMar

May 30, 2008 at 8:35 am

Adam, while I have recently discovered that there is software that will take a snapshot below the fold (ie, the whole page, even if it isn’t visible on your screen at the time), there is no reason to be penalizing what I have already. This stuff isn’t even duplicate content… it is stuff that existed briefly, then didn’t anymore. It has historical value.

6 thoughts on “If Googleplex Employees Don’t Understand The Webmaster Guidelines, How Can They Expect Webmasters To Adhere To Them?”

Leave a Comment