Remove Joomla PDFs from Google and Yahoo search results

Tuesday, 24 June 2008

As you already know, Joomla has a built-in PDF generator. The problem with PDF's is that sometimes Google places the PDFs in search results(SERPs) instead of the original Joomla HTML content article.

Somehow, the PDFs are more optimized than the HTML, probably because their keyword density is higher, and they don't include the navigation and modules usually found on a Joomla HTML page.

When visitors search google and find the PDF instead of the article, you may lose them, because they have no navigation menu, no site search, and so on. They just get annoyed waiting for Adobe's reader browser plugin to load.

The solution is simple, you need to alter your robots.txt (found in site root) and add these lines to prevent PDF's from being crawled and included in Google's index  

Joomla 1.0.x, with or without SEF: 

User-agent: Googlebot
Disallow: /index2.php?option=com_content&do_pdf=1*

Here are another 2 lines to block  Yahoo Slurp crawler from indexing Joomla generated PDFs

User-agent: Slurp
Disallow: /index2.php?option=com_content&do_pdf=1*

 

Joomla 1.5.x, with or without SEF:

 We're also disabling the print version indexing, anf the "mail to friend" window

User-agent: Googlebot
Disallow: /index.php?view=article*&format=pdf 
Disallow: /index.php?view=article*&print=1*
Disallow: /index.php?option=com_mailto*
Disallow: /component/mailto/* 
User-agent: Slurp
Disallow: /index.php?view=article*&format=pdf 
Disallow: /index.php?view=article*&print=1*
Disallow: /index.php?option=com_mailto*
Disallow: /component/mailto/* 

 

If using a third party SEF extension, you need to identify the url part that only appears when a pdf/print version is required, and enclose it in asterisks(*) . The robots.txt line that needs to be added will look like this:

 

Disallow: /*/pdf_string/* 

 

Google webmasters provides a robots.txt testing tool, so you can check your robots.txt against URLs to make sure your setup works as intende.

Google/Yahoo allow wildcard matches in robots.txt, while other search engine robots may not.

This technique will yeld its results when Google reindexes your site.

Resources:

Google Webmaster help center  I don't want to list every file that I want to block. Can I use pattern matching?

Yahoo robots.txt guide

http://www.robotstxt.org/ 



Give us some social love (it really works now)!

Reddit! Del.icio.us! StumbleUpon! Yahoo! Swik!



Comments (1)
RSS comments
1. 08-11-2008 13:12

Good article. I see a lot of advice to simply turn off PDF and printing icons to improve SEO. Using robots.txt is really the way to go. Also thanks for the links to the Google testing tool!

Write Comment
  • Please keep the topic of messages relevant to the subject of the article.
  • Please don't use comments to plug your web site. Links are rel='nofollow'-ed
  • Please refresh the page if you're having trouble with the security image code
Name:
E-mail
Homepage
Title:
Comment:

:) :grin ;) 8) :p
:roll :eek :upset :zzz :sigh
:? :cry :( :x
Code:* Code

Last Updated ( Thursday, 18 September 2008 )
 

Europe freelancer directory

Newsletter

Subscribe to TeachMeJoomla's newsletter
Name:
Email:


Auto tags

joomla robots.txt

joomla pdf generator

joomla pdf google

joomla robots

joomla disable pdf

joomla google pdf

joomla robots nofollow

joomla

robots.txt joomla

joomla robot.txt

joomla robots pdf

joomla google index pdf

robots.txt for joomla

disable pdf in joomla

google joomla pdf

disable joomla pdf

robot.txt joomla

joomla pdf

joomla pdf search

disable pdf joomla

joomla robots txt

pdf generator joomla