These programs will frequently roam around the World Wide Web, searching for new, updated, or changed web pages, which helps search engines index or “catalog” every website correctly to produce optimal search results for search engine users. Some websites are “crawled” daily, while others are not.
When a search engine spider arrives at your web page, it first looks for a robots.txt file. This is a transparent HTML code file used to tell the crawlers which areas of your site are off limits and shouldn’t be cataloged. Some examples would be pages that contain HTML code that are a waste of time (such as Flash pages). A robots.txt file will re-direct crawlers away from these types of pages. Search engine crawlers also collect outbound links from the page, and these routes will eventually be followed to other pages. Spiders follow the links from one page to another, but the frequency of visits will vary from one search engine to another, as they all have their own databases, and they’re all different.
Every website owner should know which pages the search engine robots have visited. You can find this stuff out by looking at your server log reports or checking the results from your log statistics program. If you don’t have one, you should upgrade your web hosting service. VectorInter.Net provides these tools for free, with every website hosting contract. Luckily, most search engine spiders are easily identifiable by their “user agent” names. Google’s robot is named “Googlebot”. There are many other crawlers that also have funny names, such as Inktomi’s robot, “Slurp”.