Google Answers A Crawl Budget Issue Question -

Someone on Reddit posted a question about their “crawl budget” issue and asked if a large number of 301 redirects to 410 error responses were causing Googlebot to exhaust their crawl budget. Google’s John Mueller offered a reason to explain why the Redditor may be experiencing a lackluster crawl pattern and clarified a point about crawl budgets in general.

Crawl Budget

It’s a commonly accepted idea that Google has a crawl budget, an idea that SEOs invented to explain why some sites aren’t crawled enough. The idea is that every site is allotted a set number of crawls, a cap on how much crawling a site qualifies for.

It’s important to understand the background of the idea of the crawl budget because it helps understand what it really is. Google has long insisted that there is no one thing at Google that can be called a crawl budget, although how Google crawls a site can give an impression that there is a cap on crawling.

A top Google engineer (at the time) named Matt Cutts alluded to this fact about the crawl budget in a 2010 interview.

Matt answered a question about a Google crawl budget by first explaining that there was no crawl budget in the way that SEOs conceive of it:

“The first thing is that there isn’t really such thing as an indexation cap. A lot of people were thinking that a domain would only get a certain number of pages indexed, and that’s not really the way that it works.

There is also not a hard limit on our crawl.”

In 2017 Google published a crawl budget explainer that brought together numerous crawling-related facts that together resemble what the SEO community was calling a crawl budget. This new explanation is more precise than the vague catch-all phrase “crawl budget” ever was (Google crawl budget document summarized here by Search Engine Journal).

The short list of the main points about a crawl budget are:

A crawl rate is the number of URLs Google can crawl based on the ability of the server to supply the requested URLs.
A shared server for example can host tens of thousands of websites, resulting in hundreds of thousands if not millions of URLs. So Google has to crawl servers based on the ability to comply with requests for pages.
Pages that are essentially duplicates of others (like faceted navigation) and other low-value pages can waste server resources, limiting the amount of pages that a server can give to Googlebot to crawl.
Pages that are lightweight are easier to crawl more of.
Soft 404 pages can cause Google to focus on those low-value pages instead of the pages that matter.
Inbound and internal link patterns can help influence which pages get crawled.

Reddit Question About Crawl Rate

The person on Reddit wanted to know if the perceived low value pages they were creating was influencing Google’s crawl budget. In short, a request for a non-secure URL of a page that no longer exists redirects to the secure version of the missing webpage which serves a 410 error response (it means the page is permanently gone).

It’s a legitimate question.

This is what they asked:

“I’m trying to make Googlebot forget to crawl some very-old non-HTTPS URLs, that are still being crawled after 6 years. And I placed a 410 response, in the HTTPS side, in such very-old URLs.

So Googlebot is finding a 301 redirect (from HTTP to HTTPS), and then a 410.

http://example.com/old-url.php?id=xxxx -301-> https://example.com/old-url.php?id=xxxx (410 response)

Two questions. Is G**** happy with this 301+410?

I’m suffering ‘crawl budget’ issues, and I do not know if this two responses are exhausting Googlebot

Is the 410 effective? I mean, should I return the 410 directly, without a first 301?”

Google’s John Mueller answered:

G*?

301’s are fine, a 301/410 mix is fine.

Crawl budget is really just a problem for massive sites ( https://developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget ). If you’re seeing issues there, and your site isn’t actually massive, then probably Google just doesn’t see much value in crawling more. That’s not a technical issue.”

Reasons For Not Getting Crawled Enough

Mueller responded that “probably” Google isn’t seeing the value in crawling more webpages. That means that the webpages could probably use a review to identify why Google might determine that those pages aren’t worth crawling.

Certain popular SEO tactics tend to create low-value webpages that lack originality. For example, a popular SEO practice is to review the top ranked webpages to understand what factors on those pages explain why those pages are ranking, then taking that information to improve their own pages by replicating what’s working in the search results.

That sounds logical but it’s not creating something of value. If you think of it as a binary One and Zero choice, where zero is what’s already in the search results and One represents something original and different, the popular SEO tactic of emulating what’s already in the search results is doomed to create another Zero, a website that doesn’t offer anything more than what’s already in the SERPs.

Clearly there are technical issues that can affect the crawl rate such as the server health and other factors.

But in terms of what is understood as a crawl budget, that’s something that Google has long maintained is a consideration for massive sites and not for smaller to medium size websites.

Read the Reddit discussion:

Is G**** happy with 301+410 responses for the same URL?

Featured Image by Shutterstock/ViDI Studio

Source link