A website can be likened to a book. It has a cover, a table of contents and numerous pages. While some books, and even websites, can stay the same, websites are often changed and added with new content.
If we follow the same analogy on books, Search Engines like Google, Bing, and DuckDuckGo, are the librarians that identify, catalogue, sort, and organise all the books into meaningful orders for people to find information quickly.
While Crawl spiders follow algorithms to quickly go through your site’s content efficiently, there are still some things that may affect how well your site is indexed.
We will share some best practices that will help your site’s index "crawlability".
1. Allow Spiders to crawl your site.
Review Your Robot.txt, .htaccess and Sitemaps. It is critical that all your important content is crawlable. Review your robot.text to see if you are blocking any specific important pages. On the end, you need to make sure to set-up scripts to disallow crawling of unimportant pages. Search engines do not need to access and crawl pages like logins or 404 pages.
You can also block Google spiders from crawling your page with the this meta tag: <meta name=”googlebot” content=”noindex”>. You can also add a non-index tag like “X-Robots-Tag: noindex” to de-index a page.
Lastly, keep your XML sitemap submission up to date if you make huge changes on your site. Make sure you submit it on Google Search Console/Crawl> Sitemaps.
2. Keep it Simple with HTML.
While Google spiders have grown in sophistication, pages written in HTML are the easiest type of pages to index. If you have a large site with thousands of pages, it can be to your benefit to lessen the load of your servers by avoiding heavy Javascript, Flash or XML pages. Use small HTML files whenever possible to optimize both load and crawlability.
3. Are you creating redirection loops? Audit Redirects.
Review your site for Redirect 301 or 302 chains. This type of linking can cause redirect loops that are very inefficient for search engines. Review and limit this type of linking to not more than 2 in a row to avoid locking search engines in crawling loops.
4. Review HTTP Errors and Fix Them.
Look for HTTP errors as well as Duplicate Pages errors on your site. Make sure you spend time fixing the issues to keep your site crawl error-free.
Make sure that you use rel=”canonical” to tell bots the main version of a page. This is important if you have different versions (like mobile versions) of your site. A pro-tip is to tag mobile versions of your site as the canonical version.
5. Do you have dynamic content? Review URL Parameters.
If your chosen CMS generates a lot of dynamic URLs, it can hinder search engine from crawling all your content and maybe also creating duplicate content errors. To inform Googlebot about how your CMS tags your content, you can visit Google Search Console/Crawl> URL Parameters. This will ensure that all the pages generated do not impact the search crawl bots.
6. Do you have content in multiple languages? Use hreflang tag.
If you have content that is in a different language, make sure you use hreflang tags to tag these pages correctly. This will ensure that your local language content is found by search engines and you do not create duplicate content errors as well. If are just using a single language on your site, you still need to make sure you have a hreflang tag set to the site’s language.
Find out more on how to set-up multiple languages tags here: https://support.google.com/webmasters/answer/189077?hl=en
If you have any questions on this topic or search engine optimisation, comment on this article drop us a message here.