Log files are of great importance for every SEO audit of the site. They are necessary to understand how the site is found and mapped by the various search engine crawlers.
In particular, an analysis of the crawl activities can provide valuable insights about the functioning of the site and the individual pages. The behavior of the crawlers changes regularly, so there is no point in searching for small changes every day. Log files are particularly useful when tracking trends over the longer term. For example, it can be analyzed whether the activity of the crawlers on the site shows a rising or falling line and which pages on the site are most frequently crawled.
Log files are extremely extensive and record virtually all the activities, so they can contain a lot of useful information for an SEO scan. So it is important to get those files in your hands. If the site is managed in-house, that will not be a problem, but even in case the website is hosted by a third party (ISP), the files are usually accessible. Inquire about their availability and also ask how long the log files are kept. Often the provider has log files that go far back in time, so the analysis can be carried out over a longer period, which improves the reliability.
An obvious way to analyze log files is by using Microsoft Excel. This requires some dexterity when importing, filtering and sorting the data. An imported log file provides a worksheet with many rows and columns. By using filters it is possible to make clear which web crawlers have access to the site and with which regularity. The crawling of the search engines has certain rules. One of the most important is the so-called ‘crawl budget’, the number of pages per site that the crawler searches through each day. The ‘crawl budget’ is not an unchanging number, but depends on a number of different factors. An account is taken of the negative effect that crawling may have on the speed of the host. That is why a hierarchical list of pages is drawn up, based among other things on the basis of PageRank. The crawl priority of the sections and pages can also be influenced using the XML sitemap.
The pages on the crawl list are re-crawled regularly, starting with the pages that have the highest priority. The crawl process is stopped when the crawler activity causes delays on the host. As a result, the most important pages are most frequently re-crawled and it is possible that less high-scoring pages are regularly skipped. That is why it is important to know how the crawlers spend their time on the site. It is possible that the search engines do not crawl enough pages of the website due to a limited ‘crawl budget’. An analysis of the log files can then be extremely useful. It becomes visible which sections and pages are being crawled and the frequency with which that happens. But it also becomes clear how the crawler spends his time on the site and on which parts of the site he unnecessarily loses time. For example, it is possible that irrelevant or duplicate URLs are crawled. For example, ‘URL parameters’, which are often tagged to multiple pages if they are part of the same marketing campaign. In order to exclude these URLs from the crawl process, you can log in to the Google Search Console (at Google) , select ‘Crawl’ and then ‘URL parameters’. For example, it is possible to configure and exclude URL parameters from crawling, leaving more time for searching important pages.
Also read about: Voice Search Optimization
Another important source of information from the log files are the ‘ response codes ‘ of the web server. By searching for non-existent pages valuable time is lost. When Excel sorts the log files on this column, it becomes apparent how often a certain response code is generated by the activity of the crawlers.
In principle, the only good response code ‘200’, the message that the page is found, all its ‘redirect’ codes (300, 301, 302), which ultimately end up with a code 200 are also in order, provided that it is not there. Many are error codes such as 404 and 500 indicate other problems with the site.
Are there tools on the market to facilitate this whole process? Hell yes. My preference is for Screaming Frog. This is a freemium spider tool that you can download here. The free version has a limit of 500 URLs per crawl. For small sites, this is usually enough.
Do not forget the log files …
Log files can contain a wealth of data that can be crucial for the website’s search performance. The use of log files, in addition to web-based analytic, helps to determine the ‘condition’ of a site and can help to solve problems that the site experiences.