Topic: Bots that slow your server down, any thing you can do? (Read 1324 times)
Supreme Overlord
Gender:
Posts: 149
910 credits Members referred : 0
www.centos.org
« on: Feb 26, 2006, 01:05:46 PM »
I know getting your server indexed in a search engine is a good thing. But is there anything you can do when you have a ton of bots on your different website's and it starts slowing the server down?
I am a metal monkey!
Administrator Community Supporter?
Jedai Sword Master
Gender:
Posts: 8362
43159 credits Members referred : 3
« Reply #1 on: Feb 26, 2006, 01:09:22 PM »
I am not sure if you can really do something for this, except maybe using some kind of caching.
Global Moderator
Internet Junkie
Gender:
Posts: 1523
6847 credits Members referred : 8
Gimme all your cookies!!!
« Reply #2 on: Feb 26, 2006, 03:38:52 PM »
Maybe try editting your robots.txt file to limit the bots to certain areas. There is also a meta tag in the head of your html that instructs bots to only visit after a certain period. I think though that the first one would work better than the latter...
I wish I was an Oscar winner
Posts: 90
560 credits Members referred : 0
« Reply #5 on: Mar 01, 2006, 04:08:19 AM »
Try robots.txt, and if that doesn't work, just ban their IPs
I love Pokemon
Gender:
Posts: 14
84 credits Members referred : 0
« Reply #6 on: Mar 01, 2006, 09:17:06 AM »
It very much depends on what you call a 'bot' - many are simply image or script harvesters which server no useful purpose at all it terms of Ranking PR etc. The worst bot I know of for using up server resources is the Inktomi/Slurp bot used by Yahoo. Things can be so bad with this bot that it has its own meta function you can use in robots.txt called Crawl-Delay
I am a metal monkey!
Administrator Community Supporter?
Jedai Sword Master
Gender:
Posts: 8362
43159 credits Members referred : 3
« Reply #7 on: Mar 01, 2006, 09:55:13 AM »
Thanks for sharing that Guardian. I was aware of this directive.
I love Pokemon
Gender:
Posts: 14
84 credits Members referred : 0
« Reply #8 on: Mar 01, 2006, 10:34:48 AM »
Mine is set to 60 as I regularly get between 100 to 200 of them constantly throughout the day. Now, here is a scary thought......... Many servers are configured incorreclty and allow the traversal of non-existant directories for example try on a site www.mysite.com/index.php/index.php Many sites that will work on. Yahoo will actually try to spider for non existant url's to check its bot is working. In th cases where these server configurations allow traversal of non existant url's the header response to the bot will be a 200 (instead of 404) so the non existant url will actually get indexed !!!
Global Moderator
Internet Junkie
Gender:
Posts: 1523
6847 credits Members referred : 8
Gimme all your cookies!!!
« Reply #9 on: Mar 01, 2006, 04:06:39 PM »
that could be a problem especially if you have mod-rewrites pages that are not checked, cause in some cases you may even have any number of pages that have the exact same content and the may think that you are spamming!
I love Pokemon
Gender:
Posts: 14
84 credits Members referred : 0
« Reply #10 on: Mar 01, 2006, 06:08:55 PM »
Exactly. As as there is no way to know what url Slurp or other bots may choose to test a server for correct header response, so you cannot exactly 'anticipate' what it is and set up a redirect or redirectmatch.
As the server actualy produces a page, it has content and therefore can be indexed even though the url does not actually exist. The index.php/index.php is a classic example of a non existant url that I have seen indexed - Google search gives some excellent examples..
I am a metal monkey!
Administrator Community Supporter?
Jedai Sword Master
Gender:
Posts: 8362
43159 credits Members referred : 3
« Reply #11 on: Mar 01, 2006, 06:18:22 PM »
Quote
Google search gives some excellent examples..
650.000 indexed pages. That's a lot I guess.
I have seen that even with pages that have 404 error codes, Slurp is visiting them again and again. I had made a error in a link before about two months, and even now Slurp is still visits that non existed page.
Maybe this is a bug of the Slurp crawler or somthing.
Global Moderator
Internet Junkie
Gender:
Posts: 1523
6847 credits Members referred : 8
Gimme all your cookies!!!
« Reply #12 on: Mar 01, 2006, 06:24:56 PM »
I have added a perminant redirect to my pages that are incorrectly indexed or moved pages, the they have been visiting them for months still... don't know why they are waisting their time?
I have added a perminant redirect to my pages that are incorrectly indexed or moved pages, the they have been visiting them for months still... don't know why they are waisting their time?
If you are redirecting, then the bot(s) see it as a valid page as the header response is 200 If you see a bot crawling an invalid url by examining your server logs etc, you should redirect to a 404 error page so the bot recieves the correct header response and will eventually drop the url from the indexed cache. Of course, sometimes it is preferable to redirect to a valid page but eventually you end up in a situation where a bot may thnik you are spamming due to duplicate content - this is a nightmare lol
Cyberpunk Wannabe
Posts: 38
256 credits Members referred : 0
« Reply #14 on: Mar 01, 2006, 09:01:05 PM »
I don't think that having duplicate content is a really big problem for the search engines.
If you have duplicate pages, they just don't get indexed.
MODs : I think this post should be in the SEO category.
I am a metal monkey!
Administrator Community Supporter?
Jedai Sword Master
Gender:
Posts: 8362
43159 credits Members referred : 3
« Reply #15 on: Mar 01, 2006, 09:02:51 PM »
Topic moved. Thanks
As for the duplicate content, I think that you may be right but I can't be sure about this at this moment.
I have added a perminant redirect to my pages that are incorrectly indexed or moved pages, the they have been visiting them for months still... don't know why they are waisting their time?
If you are redirecting, then the bot(s) see it as a valid page as the header response is 200 If you see a bot crawling an invalid url by examining your server logs etc, you should redirect to a 404 error page so the bot recieves the correct header response and will eventually drop the url from the indexed cache. Of course, sometimes it is preferable to redirect to a valid page but eventually you end up in a situation where a bot may thnik you are spamming due to duplicate content - this is a nightmare lol
I use the following to show the surps that it is permanent: