Every search engine follows certain basic rules in order to generate results from the web pages: crawling, indexing, and ranking (serving). These steps follow one after the other systematically. Ranking cannot take place without indexing and indexing cannot take place without crawling.
Crawling is the process in which mechanisms known as ‘web crawlers’ access the web pages that are publicly available. Indexing is the process in which a search engine reviews the content of every page and accumulates all the pieces of information it finds. Ranking is the process in which a search engine presents the most relevant results taken from its index in a chronological order after a user types a query.
A website’s internal linking structure is very crucial since crawlers use the links on the website to find other web pages. New websites, any changes made to the existing websites as well as the dead links are prioritized accordingly by the crawlers. The websites which are to be crawled, the frequency of crawling, as well as the number of pages which the search engine will fetch; are all determined by an automated process. The hosting capabilities including the bandwidth and the server resources also influence the crawling process.
This is where the crawl budget comes into play. Crawl budget denotes the number of web pages that a crawler is prepared to crawl in a certain period of time. The search engine establishes the crawl budget for any website automatically, by taking certain key factors into consideration, such as: the size of the website (bigger sites need more crawl budget), the server setup (page load time and the site performance), the frequency of updates (content that is updated regularly will be prioritized), and the links (dead links and the internal linking structure).
Search engines like Google have mechanisms that ensure that their crawlers visit a website only as frequently as is feasible for the website. Crawl rate limit is one such mechanism which enables Google to establish the crawl budget for any website. Crawl demand is another such mechanism which takes into account the demands from any specific URL from the index; so as to determine how active or passive the URL should be. URL popularity and its staleness are the two factors that help in determining the crawl demand.