Never Ending Quest to Kill Spam.
Posted on 09.12.05 by lucia @ 9:04 am

As many of you know, I was hacked a while back week. Naturally, this caused me to monitor my site and make more changes to either keep out spam or reduce pointless bandwidth losses to stupid search robots. Today, I’m going to tell you what I did about robots.

First, I had previously installed a robots.txt file in my public folder. The robots.txt file exists to tell robots to keep out of certain parts of my site; you can see mine here lucia’s robots.txt file. (Heck, you can see Google’s here. If you ask, I can explain some of the “why’s” for each exclusion line. Most are in there to avoid wasting the time of innocent web surfers.)

Mind you, I want indexing robots to crawl my site. I love the fact that google, msn, yahoo and all helpful services to index my site, and help knitters my ridiculously long articles on knitting stockinette, knitting rats or anything else them might want to find. That said, from time to time, I notice the lower IQ bots seem to just index everything in a pointless and stupid way.

What do I mean by a pointless and stupid way? Here’s an example: I recently added a convenient “email this article” link at the bottom of my articles. When people click, this, they go to pages with addresses similar to this:

http://www.thedietdiary.com/blog/wp-email.php?p=411

Notice the “wp-email.php?p=411″ part of the address? Every single one of my blog articles has a different number; the one you are reading right now is number 413. Anyway, three particularly dumb robots were indexing every single one of those pages “just in case”! (Google stayed away. Don’t ask me how it knew to stay away. It’s normal for bots to be dumb and just crawl.)

None of these bots are malicious, so you might think I wouldn’t mind. But when they load a page to index it, that uses my bandwidth. If I keep them away from the page, that saves me bandwidth. Plus, let me assure you, the only people who want to get to those pages are visitors who click one very specific link on one specific blog article. So, I’d just as soon the bots didn’t visit those pages and index them. That way, they won’t waste my time, the bot’s time or the time of any innocent victim who might follow a link from a search engine to that particular page.

To fix this pesky stupid crawling by bots, I added this to my robots.txt file:

Disallow: /blog/wp-email.php

Since that time, the well behaved obedient search index robots have stopped loading those pages. That’s a nice thing.

Ok, but while I was at it, I figured out that spam bots might load the page too. Admittedly, since there are no comment boxes, the spam bots can’t do much harm they visit, but I’d just as soon not provide access. So, I added these lines to the .htaccess file in my ./blog directory:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^http://www.thedietdiary\.com/ [NC]
RewriteCond %{HTTP_REFERER} !^http://thedietdiary\.com/ [NC]
RewriteRule \wp-email\.php$ - [F]

The lines added to .htaccess will prevent anyone from loading those pages unless they came from my site; that keeps the disobedient spam bots out for good measure.

Unfortunately, I couldn’t leave well enough alone. I monitored a little longer. I noticed the dumb bots and spam bots were loading “../blog/wp-comments-post.php” . That’s a program that runs when someone fills out a comment form at my blog and then clicks submit. There is no good reason why anyone should run that program unless they clicked a link from my blog. So, the bots just don’t need to go there. To keep both nice obedient and nasty spammy bots out, I copied the lines illustrated above, and changed “wp-email\.php” to “wp-comments-post\.php” and added the respective commands to robots.txt and .htaccess.

Great, right? Even more naturally, almost immediately, I received an email from someone who tried to post comments. They typed everything in, and were sent a “mysterious” message. Lucky for me, they saw my private email. I think the problem was my “super clever” attempt to keep the bots out. Sigh…

I realized that sticking those lines in .htaccess file will also block may behind firewalls or using privacy screens to hide their referrers. Now, I’ll admit I don’t quite know why anyone wants to hide their referrers but people do. Since I want comments for my sake, I’m going to take the lines about blocking access to comments out of the .htaccess file. I’m leaving the no-referrer block on the wp-email.php script though. After all, people can still just cut and paste an article url into their email program, so they can still tell their friends about an article if they want to do so.

Anyway, maybe this will help me save some bandwidth. I’m not sure why wasting bandwidth bugs me, but it does. It just does!


Please leave comments! None

No Comments »

No comments yet.

RSS feed for comments on this post.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(required)

(required)



Previous posts: ( Pick Up Dropped Stitches in Garter Stitch | Home | Tubular Cast On)
 

Lucia Liljegren: Copyright 2005-2007 Rights to all site content including knitting patterns, generators and haikus reserved.

today's page