Web Robots: The Good, The Bad And The Ugly
This page is being posted on this web site as a result of my disgust with the spy and rogue robots on the internet. I have banned about 50 robots from crawling this site per my robots text file. Some robots honor the robots text exclusion protocol. Some do not. It would be splendid if robots that do not honor the robots exclusion protocol were banned from the internet. The only other alternative in keeping unwanted web robots off your web site is by editing your .htaccess file. Incorrectly editing a .htaccess file can make it impossible for anyone to access your web site including yourself. I have no intention of messing with my .htaccess file again. It is impossible to ban all of the spy and rogue internet web robots including the massdownloader programs from accessing your site. There are many rogue robots, spy robots and massdownloader programs in existence that webmasters may be unaware of including future ones.
Listed below are explanations of internet plagues excluding viruses, worms, trojan horses, identity theft and adware/spyware:
1. Massdownloader Programs - Programs that are used to download entire web sites to a computer hard drive. These types of programs are bandwidth wasters no webmaster I know of is being paid to live with. It's unnecessary for a webmaster to put his or her web site online for the purpose of being downloaded for offline viewing. One could simply create a pdf version of their web site, or a microsoft word document, etc. and upload those files to download.com or freebie web sites offering free downloads of their web site(s). This method would certainly eliminate web hosting fees and wasted bandwidth for webmasters. Someone recently landed on this web page after doing a search on google for massdownloader programs. I won't list the names or hyperlinks to any such programs on this web site. I don't need anyone here using them while visiting this site. I don't think any webmaster will appreciate having their bandwidth wasted by individuals downloading a 10 to 50 plus megabyte size web site(s). On the other hand, if fees were paid to webmasters for such activity we would all welcome it.
2. Rogue Robots - Robots that crawl web sites for the sole purpose of downloading files off of a web server. These resource hogs do not send human visitors to a web site. They simply to exist to grab your web site files off your web server. They offer no compensation to webmasters either. Webmasters can probably thank the web site content thieves (on and off line) for sending these web robots to our web sites.
3. Spy Robots - Robots that are used to crawl web sites to retrieve information to send back to a database somewhere on or offline. Some are sent to crawl web sites to look for link exchange partners. Either way, they do not send human visitors to a web site. I'm not being compensated to put up with them either. I guess you could put the email harvesting robots in this class as well. I recently banned one of these from an internet egroup. I don't send spam emails and appreciate not getting any. None of us need email harvesting robots wasting our bandwidth to gather email addresses from our web sites for spammers to send spam emails to.
4. Ugly Robots - Web robots that do not honor the robots text exclusion protocol. They crawl your web site against your wishes. They even crawl files and folders that you don't want robots to crawl or index. My guess is that someone is afraid they will miss out on some web site spying and downloading opportunities. A new sheriff is in town on this web site prior to the update to this article. That sheriff's name is "Knowledge". You can host a "decent" web site without putting up with "unwanted" robots running rampant on your web server and having a field day with your web site(s).
5. Good Robots - Web robots that send human visitors to your web site such as those sent from search engines. I have no complaints about these types of robots. I had assumed that sending visitors to a web site was the purpose (and only purpose) for the existence of web robots to begin with.
SOLUTIONS:
1. Properly edit your .htaccess file to ban plague robots or certain IP addresses from accessing your web site. This will keep the undesirables away from your site. Visit Google to locate more information on implementing this technique.
2. Place web files and folders in a password protected directory on your web server to keep them safe from the snoops, spies and downloaders. You should be able to get information from your web hosting company on this subject.
3. Don't upload anything to a public web server that you don't want indexed by the search engines, crawled, downloaded or viewed by the public.
4. Create a web site or transform your current web site into a members' only web site. That would eliminate a lot of on-line headaches for most webmasters, especially for non-commercial web sites. There is no need to buy ebooks or software to accomplish this task. A members' only web site can be easily set up by using the technology available from a reliable web hosting service.
5. Block access to your web site(s) from non-search engine robots. Most if not all of the non-search engine robots that have been sent to crawl this web site are rogues, spies and bandwidth wasters. Access to your site can also be blocked from web sites you don't want linking to your web site for whatever reason. I don't think anyone in their right mind will leave a hyperlink to another site on their site after discovering that their visitors can't access your site from theirs.
Here are the main reasons visitors from certain web sites are denied access to this web site:
I have visited web sites and have not seen a hyperlink (text link) or a graphic link to this web site. I find it strange that these web sites have shown up as a supposed "referrer" in my web stats. They were supposedly sending visitors to the Easylisteninganne web site. Visitors from these sites will get a 403 (forbidden) access message on my server when they attempt to visit my site from these "so-called" referrer hyperlinks or graphic links. Someone is probably playing the framed web page game in a case like this. Some folks are too sorry to build and optimize their own web pages. It's simply easier to build a framed web page redirect and display someone's else content on their web site as their own. I will get the last laugh when visitors from these smart alec webmaster sites along with the search engines follow these types of links to my web site. The search engine robots will know that visitors from such-and-such a site has been blocked from accessing this site when they check my .htaccess file.
Even ebay.com has shown up recently in my web stats. I'm not speaking ill of the ebay web site or their services. The ISP provider who was supposedly sending traffic to this site from ebay.com was recently banned from accessing this site. The IP addresses of that host and referrer connecting with this site was not an ebay IP address. That Internet Service Provider wasn't one of the "reputable" ISP's that I'm aware of. There is no logical reason for ebay to hyperlink with the Easylisteninganne web site at this moment. I mention this to show how unethical some individuals can be in the internet world. Again, this was probably a site-mirrowing or a framed web page redirect attempt by someone.
Here is a simple definition of site-mirrowing for those who may not understand this terminology: Creating an exact duplicate of the contents on another web site and uploading those files to a web server. These clowns don't seem to understand that replicated web sites will get blacklisted and banned from the search engine listings. These people are not hurting the search engine traffic of the web sites they are framing or duplicating. Their web sites and web pages will lose search engine traffic! I won't go into the details of web page framing. Let's just say that I know how it's done and can easily implement it myself. The only "legitimate" use of such a technique would be for promoting affiliate programs. Be sure to get that webmaster's permission before doing so. I wouldn't want to embarrass myself by framing someone's web page(s) without their permission. A smart webmaster will eventually catch on to you and block access from your site to theirs. I don't think the search engines will look too kindly on such practices either. There are also copyright issues to take into consideration.
There were two directories on my web site that I had placed in my robots text file that I didn't need to have indexed or crawled. Those folders have since been renamed. Don't look for them. Those file names have also been removed from my robots text file. I later learned that it's unwise to list files or directories you don't wanted crawled or indexed in a robots text file. Some snoop will go to your web server and download the entire contents of that directory! I won't discuss how it can be done.
Be sure to turn off the indexes of the directories that you have uploaded to your web server. You can accomplish this with a good web host such as ezwebhosting. This will keep the snoops from viewing the contents of your file directories. Each of my file directories was renamed to make this move fool-proof. This procedure will save your bandwidth for better uses. Snoops can't download entire directories from your web server if they don't know the names of those directories. I suggest that you give these directories "hard to guess" names. Don't bother placing the names of any file directories in your robots text file.
There have also been snoops viewing my .text files in the past. Text files are needed along with the html files and image files to display the images on web pages. I solved the snoop viewing of my text files by renaming most of them. The text files I didn't rename have no more than 2 or 3 images listed on them. Only the search engine robots need access to these web files anyway. The robots text file for this web site contains a list of unwelcome spy and rogue robots, and user programs that are not needed or welcome here.
Be sure to keep an eye on your visitor stats. The latest visitors panel is usually the first stop I make when I access my web server. That is the area where I have discovered information about unneeded spiders, crawlers, robots, snoops and unethical webmasters accessing this site. Write down the IP addresses of these bandwidth wasters and content thieves. You should be able to get the IP addresses of visitors to your site through your web hosting account. Then visit arin.net. If the IP address does not belong to the arin network, it may belong to apnic.net, ripe.net, lacnic.net or afrinic.net. You will get this information after visiting arin.net and typing in the IP addresses you are investigating.
From here you can decide whether or not you want future visits from IP addresses you have investigated on line. I recently discovered an easier, error-proof way of editing a .htaccess file. You can actually edit the .htaccess file without touching it. The procedure is called IP banning or IP deny. You accomplish this by blocking access to your site from visitors attempting to visit your site from certain IP addresses. Using this technique has prevented me from messing up my .htaccess file where no one could access my site. Technology is also available where you can get an IP address of a "problem" web site linking to your site. I've mentioned that type of web site above.
Truly, I am not concerned about blocking access to legitimate users who want to visit this site. People can easily find this web site listed in the major search engines. It is simply a waste of bandwidth to put up with some of the characters visiting this site from certain IP addresses in the world. I hear that the spy and rogue robots along with the massdownloader programs can even "interfere" with other web surfers visiting a web site. How? By overloading a web server and slowing down the browsing experience of others. Who needs them? I have been reading several forums lately. I'm not the only webmaster who is tired of the bandwidth wasters and content thieves crawling web sites. If the people using these non-search engine resource hogs (rogue, spy robots and massdownloader programs) on the web would start paying for their crawls and file downloads, webmasters would have an incentive to stop complaining about them. *This web page was last updated on February 13, 2006.
Easylisteninganne.com
©2006 Anne Allen