One of my favorite developments in recent years is the AJAX-driven dynamic search results filtering that you find on sites like Kayak.com and Endless.com. From a usability perspective, it’s a giant leap forward in product and results filtering that genuinely puts the user in the driver seat allowing real-time filtering driven by color swatches, slider bars and other similarly intuitive controls. It’s truly “fine tuning” of results.
However, like a lot of things on the web that are cool, fun and dreamt up by innovative web developers, this AJAX-powered filtering method can wreak havoc on your site’s “search engine friendliness” and ultimately negatively impact your natural search rankings.
AJAX is not a technology that search engines understand yet. To understand how search engine crawlers experience AJAX features, you have to turn off your browser’s JavaScript and then attempt to use the site as you normally would. You’ll find that some sites simply display a message telling you that you must have JavaScript enabled in order to view the site. In web programming parlance, a site like this is said to “not fail gracefully.” Those sites that do fail gracefully still allow the user (and search engine crawlers) to navigate the site and find the information they’re looking for. If you turn off your JavaScript and try this out for yourself, you’ll notice that the user-experience is no longer fluid because the pages have to reload in order to display updated content. You’ll also notice that instead of remaining on the same URL, you are now being taken deeper and deeper into SEO hell as more and more parameters are tacked on to the URL in order to facilitate the filtering process.
Example of URL from Endless.com for Men’s Adidas Lace-up Fashion Sneakers, size 10 in white.
The URL for the same product search performed with javascript turned off:
OUCH! (and I should mention that I was not able to get the same set of product results with javascript turned off).
It’s not just AJAX that causes these URL-based problems for search engine crawlers. Other common features that have similar undesirable side effects are:
- Session tracking with URL session IDs
- Site search utilities for searching content within a site
- URL Tracking parameters parameters for visitor analysis or affiliate tracking
- Printer-friendly page versions
Why is this so bad for SEO?
At the most basic level, it’s creating a huge duplicate content problem that just multiplies with each new option that you throw into the mix. I’m not talking about the type of cross-domain duplicate content associated with stealing another site’s content that can get your site penalized. Not all duplicate content problems are so nefarious. Duplicate content within a site is not that uncommon, especially with blogs, and doesn’t result in any sort of penalty.
The problem with duplicate content within a site is that it dilutes the strength of your site and very often creates problems with indexation. Think of a well-optimized, duplicate content-free site like a well-brewed beer. Adding duplicate content is like adding water—if you add enough you lose the taste of the beer. From a more technical perspective, what’s happening is that the search engines are finding multiple pages that are optimized for a given keyword, each with a different number of external links pointing to it. So, instead of having one strong page the search engines would favor, a site ends up with many weaker pages that have much less potential to rank.
There are other undesirable side effects from this duplicate content problem. One is that the wrong page from your site might end up ranking. A perfect example of this is the “printer-friendly” version of page, which is duplicate content of the “regular” page, typically has no navigation and might lack the formatting and call to action of the regular page. You would not want a visitor landing on the printer friendly version of the page directly from the search results.
Another side effect is decreased crawl efficiency. Ideally, you want the search engines crawlers to come back to your site regularly and download fresh versions of every page. By drastically increasing your site’s page count with superfluous pages, you just create more work for the search engines crawlers and reduce the chances that they are crawling the most important pages regularly.
How to Deal with Duplicate Content
There are numerous ways to deal with duplicate content. I’m going to divide them into two categories and detail the best time to use each of the methods.
Category One: Search Engine Tools (Canonicalization Approaches)
The first category of methods for dealing with duplicate content on your site is tools and conventions that the search engines provide. All of these solutions refer to a process called “canonicalization,” which, by definition, is a process for converting data that has more than one possible representation (i.e. your duplicate content) into a "standard" canonical representation (i.e. a single version). All of the methods in this section are prescribed by the search engines (predominantly Google) for dealing with duplicate content.
www. vs non-www URLs – this is largely a problem of the past, but search engines used to treat http://www.yourdomain.com/somepage.html and http://yourdomain.com/somepage.html as different pages because technically they are different fully-qualified URLs. Google provides a way of setting your preferred domain in Google Webmaster Tools. Login to you Webmaster Tools account and look under “Site Configuration > Settings” and you’ll see a section for “Preferred Domain.” Simply choose the presentation you prefer.
Since Google is the only search engine to provide a setting for this in their webmaster control panel, I also always recommend setting up a global 301 redirect to your preferred domain. The methods for doing this are different depending on your hosting platform and the solutions can be easily found with a simple search.
Meta "Canonical" tag – this is a convention that all 3 of the major search engines announced they were supporting early in 2009. Think of this as a "suggestion" of a 301 redirect. This meta tag goes in the <head> portion of your webpage and takes this form:
<link rel="canonical" href="http://www.yourdomain.com/somepage.html" />
This is a great solution for those nasty filtering URLs where the filter parameters are continuously tacked on to the end of the URL. You should have a clean, base URL that represents the unfiltered product or service, such as http://www.yourdomain.com/theproduct.html. On each of the subsequent filter pages, as you filter the available product options by color, size, features, etc. use the canonical URL tag to designate the base product URL and put all of your optimization and link building efforts into that URL.
The search engines have stated that they may not always pay attention to this tag, and I suspect that the biggest factor that would cause them to ignore the tag is a significant number of internal or external links pointing to one of the nasty filter URLs or substantially different content between the pages (especially different title tags), so be aware of how people are linking to your pages and how your filtering system alters key on-page elements.
For great coverage of this topic see Rand's post at: http://www.seomoz.org/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps
Dynamic Parameter Handling – Google just recently announced this feature addition to their Webmaster Tools, joining Yahoo!’s Site Explorer which has supported dynamic parameter handling for some time.
This webmaster console feature allows users to designate which URL parameters the search engines should ignore. A URL parameter is any variable=value pair. Again, this is a great feature for dealing with those ugly URLs that are built using variables, such as http://www.yourdomain.com/theproduct.html?color=blue&size=5.
Another URL parameter that is notorious for creating duplicate content is the sessionID. If any of the URLs on your site have “sessionid” in them, this dynamic parameter filtering is a must.
You can specify up to 15 parameters to ignore, so this feature is limited. Also, like the canonical URL tag, the search engines look at your settings for this feature as a recommendation, not a directive.
See Google’s recent post about the addition of this feature: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=147959
XML Sitemaps – XML sitemaps allow webmasters to suggest URLs to the search engines. Like the canonical URL tag and the dynamic parameter filtering, your sitemaps will not become law with the search engines. The search engines’ primary means of understanding your site and discovering new pages is the actual crawl process, because this most closely matches how an actual user experiences your site.
You can read more about using sitemaps here: http://www.google.com/support/webmasters/bin/topic.py?topic=8476
Best Solution: Use Them All – Since each of these methods for dealing with duplicate content on your site is looked at by the search engines as a “suggestion,” you should use as many of them in combination as you can. There’s no reason not to set your preferred domain (www vs. non-www) in Google because it takes less than a minute. Coding the canonical URL tag into your site’s content management system could be a monumental task, so that might be something to put on the roadmap for future site developments, but the dynamical URL parameter filtering through the webmaster consoles at Google and Yahoo! is another pretty easy feature to setup. Setting up XML sitemaps might also be a monumental task, especially since software and web-based services designed to automate this process rely on crawling technology similar to the search engine crawlers to map the site site and create the sitemap. This method of sitemap generation would capture the ugly, parameter URLs as well and would therefore could require a lot of work to clean them up before submitting to the search engines. Many content management systems come with XML sitemap generation features or have add-on modules that accomplish the task. These will likely require some configuring to make sure that only the clean, base URLs are included. If the process must be done manually, an easy solution is to start with the most important pages of your site like the main category pages, the best selling products, etc.
Category Two: Server and Coding Methods (Non-Canonicalization Approaches)
The tactics in this section for dealing with duplicate content do not rely on canonicalization, but rather provide methods for guiding search engines crawlers, in essence, taking them by the hand and leading them through the maze of your site. One of my biggest problems with the canonicalization prescriptions provided by the search engines is that they are only suggestions and beyond that, the search engines don’t provide any additional color around why they listen to these suggestions sometimes and not others. I like more direct approaches that I know are being respected. Search engine crawlers have a beautiful feature built in to them: subservience. There are many different ways to tame a bot.
Meta Robots Tag—“noindex,follow” – the meta robots tag is a tag created specifically for providing instructions to the search engine crawlers at the page level. Of the different commands that can be given to a crawler using this tag, the one most pertinent to this discussion on duplicate content is “noindex, follow” which takes this form:
<meta name=”robots” content=”noindex,follow” />
This command tells the search engines “don’t waste time downloading this page into your index, but please, follow all of the links on the page.” This can be a great solution for those filter pages with the nasty URLs that don’t introduce any new content with each additional filter parameter. It’s also well suited for the search results pages from your site’s search feature where the results pages themselves don’t provide any value and are constantly changing, but the links on the page are important to follow.
This tactic prevents the crawl inefficiencies by directing the search engines to not download pages and it prevents duplicate content from getting included in their index while still allowing an open door for the crawlers to freely crawl the site.
301 Redirects – There are times when you come across duplicate content on your site that just doesn’t make sense - a page on the main site duplicated in an archive or a page leftover from a redesign that got included somewhere else. Whatever the reason, when you find duplicate content that just plain isn’t necessary, you should use a server 301 redirect from one of the pages to the other. These pages may be splitting link juice (there could be links from external sites pointing to both pages, dividing the value each page should be getting) or giving to the search engines confusing signals. You should look at analytics or your server log files to see which of these pages is getting more referral traffic, because that is likely the page with the most back links and the one that should remain active. Sometimes the layout or design of the page will dictate which should stay regardless of link data. Either way, the 301 redirect will help consolidate the value of these duplicate pages into one page.
Robots.txt File – The robots.txt file, found at http://www.yourdomain.com/robots.txt is the first thing that any crawler is supposed to check when accessing your site. This is your first line of defense against the crawlers and can be used to provide directives about which pages or directories to avoid crawling. If your site is setup properly, the robots.txt file can be a good option for preventing the search engine crawlers from crawling and indexing endless URLs built from filter parameters or your site’s search results pages. The way to configure your site to take advantage of this approach is to have the filter feature and site search functionality tied to one directory or page and use that directory/page only for the filtering or search results content (i.e. don’t put any unique or important content in there that you want crawled).
For instance, your main product page for widget X might be http://www.yourdomain.com/widget-x.php From this page, visitors might want a blue widget x with a wooden handle and pink fur. Choosing to filter this, or any product, should send the user to a special filter page or directory, such as http://www.yourdomain.com/filter/widget-x.php?color=blue&handle=wooden&fur=pink.
In your robots.txt file, you would include a directive to disallow the /filter/ directory:
User-Agent: *
Disallow: /filter/
With this directive in place in your robots.txt file you can go crazy with your filtering as long as it all happens in the /filter/ directory. The search engines will avoid crawling this directory and keep them focused on the important, optimized pages of your site.
Be aware when implementing this method that you will get “errors” in your Google Webmaster Console report for “Pages Blocked by robots.txt.” I’ve never agreed with Google including this report under the “Crawl Errors” section because the robots.txt file is a legitimate means for controlling spider access to your site. The purpose of that report in GWT is to point out the pages that are being blocked in case you weren’t aware. You are likely to get a rather large number of these “errors” in GWT if you have had the pages live and crawlable for a while and are now blocking them after the fact. This is because Google has them in their index and will try accessing them directly versus finding them through a crawl. If you launch the pages and have them blocked from the start, you should experience relatively few error messages.
URL Rewriting – This option definitely requires some significant technical and programming know-how and should only be considered by advanced developers. URL rewriting is a server-side method for “prettying up” ugly URLs. Server-side means that it is performed by some method on the server and thus happens before the page is sent to the user or search engine crawler.
URL rewriting mostly involves writing rules that your server uses to interpret and translate URL parameters. For you instance, you might have a rule that says “turn all instances of color=blue into /blue/ ” It’s kind of like the URL parameter handling feature the search engines offer, but instead of telling the server to ignore the parameters, it tells the server to translate them into something more friendly and readable.
Personally, I find this approach to be MUCH easier on servers using Apache, which relies on the .htacess file for holding the translation commands. On servers using Windows running IIS, I’ve found ISAAPI Rewrite to be a decent alternative, but there is very little support for the product and its coding conventions in the wider developer community.
Conclusion
Conclusion
See how annoying and unproductive duplicate content can be. Keep in mind that we’ve just been looking at the problem of duplicate content on your site and not duplicate content from other people copying your site. Duplicate content within your own site can dilute the value of your content and hinder search engine rankings, not to mention that it can make for a horrible user experience.
This post laid out a number of different methods for handling the problem. Some were methods provided or supported by the search engines, while others were webmaster solutions with roots in server and bot management from a time long before Webmaster Tools were available from the search engines.
My personal opinion is that the methods provided by the search engines through webmaster consoles are more for the lay-people who don’t have system administrators or hardcore coders on their staff. When possible, I prefer the server and coding solutions, mainly because I’m old school and consider these solutions to be more definite, but that doesn’t mean I wouldn’t use or advocate for any of the solutions outlined above. Each has its own “right time” to be used.
Jason Cooper
Director, Value Added Services

Hi Jason,
Thank you for the great post.
Do you recommend using Meta Robots Tag for landing pages specifically designed for SEM that may or may not have duplicate content?
Also, do you know what type of crawler Google uses to determine relevancy of landing pages in AdWords? Adding Meta Robot Tag should not cause any conflict, correct?
Thank you.
Petra
Posted by: Petra | October 15, 2009 at 10:20 AM
Petra, great question.
If you're talking about a landing page that is dedicated to paid search (meaning you can only access that page through a paid ad) then there is no need for any special tags on that page, whether it has duplicate content from your main site or not.
If the landing page can be found on your site by clicking through a series of links AND you feel that the page has duplicate content on it, then you could use the meta canonical tag and refer to the original page of content from your site.
Google uses a special crawler to crawl landing pages it finds in customer's AdWords account (the destination URL). This bot will show up in your server logs as adsbot-google. It's important to keep in mind that this crawler is a robot (bot) and will therefore obey any robot commands such as those in the robots.txt file or meta robots tag, so be careful to NOT block the adsbot-google from crawling your landing pages because it will cause your Quality Scores to plummit.
I hope that answers your questions.
Posted by: Jason Cooper | October 19, 2009 at 08:50 AM
I didnt realise AJAX isnt understood by search engine but there again baring in mind that java script is a main component of AJAX then it does makes sense.
I have dupe content issues i think with my new site End Gynecomastia where index.php, www and the non-www have all been indexed as 3 different indexes.
What do you think I should do? There are so many ways round it but i dont to do trial and error incase I go badly wrong.
BTW, informative post.
Posted by: End Gynecomastia | November 29, 2009 at 04:26 AM
You're on an Apache server, so the best thing you can do is setup 301 redirect rules in your .htaccess file that redirect all non-www URLs to their www equivalent and redirect the index.php file to the root domain. SEOMoz has a Web Developer's Cheat Sheet with a simple explanation of how to do this: http://www.seomoz.org/blog/the-web-developers-seo-cheat-sheet
Posted by: Jason Cooper | November 30, 2009 at 10:08 AM
Wow, this is a great resource for local marketing which is clearly the direction the search engines are and have been heading. Thanks again!
Posted by: senthiledp | June 21, 2010 at 08:22 PM