Since I joined City A.M., we’ve been gradually migrating away from Drupal 7 to a fully bespoke CMS built on Laravel. We had planned to do this anyway, but we accelerated parts of the migration as we hit issues that we were unable to solve. Here’s a summary of the main ones.
Issues with paginated archives
We originally had paginated archive pages, which we didn’t consider replacing until quite recently, but at the time we needed to keep the functionality in place. What we noticed was that crawlers were spidering every page in the archive, some 3,000+ pages for some of our categories. This resulted in crawlers hitting every page in our archives one by one – and as most of the pages beyond the first few were uncached, it caused serious load issues on the website whenever we were crawled.
We considered caching the archive pages for longer, but simple paginated archives can’t be cached in this way. If you add one story to the website, everything else gets pushed down one place in the archive – which means every page changes. If you have 25 stories per archive page, as soon as you add 25 new stories, each page is completely different – and you’ll have one new page on the end.
Caching these pages for longer doesn’t solve the problem, it just delays it. Varnish was really saving us during traffic peaks – when I started, an uncached article page took over 4 seconds to load (this included an empty Varnish cache and also an empty Drupal cache for that page). With our Laravel setup, these pages dropped to under 2 seconds for an uncached page, and also resulted in much lower server activity per page load. Caching is a must for any high-traffic site – but we found it was even more important in Drupal.
To be fair, pagination going to thousands of pages isn’t a problem limited to Drupal, and most news sites disable it anyway.
HTTP 200 status returned for bogus URLs
Firstly, consider a URL such as this:
http://www.cityam.com/news
That’s a valid URL, and returns HTTP 200. But what about this?
http://www.cityam.com/news/asdfasdfasdf
It’s an invalid page. But as part of the URL contains a valid route, Drupal 7 returns HTTP 200. Worse still, the canonical URL for the page matches the actual (bogus) URL you’re on, rather than the real URL.
We had various weird URLs in Bing Webmaster Tools, including this:
http://www.cityam.com/news/2013/03/26/london-is-still-the-strongest-financial-centre-in-the-world
Because it contains /news at the start, in Drupal 7 it returned HTTP 200. This URL persisted in Bing despite our attempts to remove it – in fact we found that it started to get indexed with a slight variation:
http://www.cityam.com/news/2013/03/26/london-is-still-the-strongest-financial-centre-in-the-world?page=10
And that got indexed with thousands of different page numbers.
A bug report suggests this issue is fixed in Drupal 8 – but it’s incredible to read the comments there. Either people aren’t fully understanding the impact, or it’s a real pain to fix – but back in 2014, we couldn’t hang around to see if or when this would be fixed.
Having said that, it looks like the Independent found a way to solve it… this URL correctly returns a 404, for instance:
http://www.independent.co.uk/life-style/gadgets-and-tech/news/aaaa
Either way, this issue caused us a massive problem due to crawlers seeing HTTP 200 for some pages that did not exist – causing high load as those pages were often not cached.
Non-existent pages return an HTTP 200 status
On the Acronis blog (powered by Drupal 7) it goes back 40 or so pages. Here’s a weird quirk though. Notice the URL says page 39. However, page 40 is highlighted. Page 1 has no page number; page 2 has ?page=1 in the URL. Despite being fairly consistent, it’s a bit confusing.
So let’s go to page 41 – or page 40 in the URL. Now, what happens if you go to page 999? It loads the page with no content. How about if we load that page with Curl?
$ curl -I http://www.acronis.com/en-gb/blog/posts?page=999
HTTP/1.1 200 OK
Server: nginx
Date: Sat, 10 Sep 2016 12:15:47 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Keep-Alive: timeout=20
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate
X-Content-Type-Options: nosniff
Content-Language: en
X-Frame-Options: SAMEORIGIN
X-Generator: Drupal 7 (http://drupal.org)
Link: <http://www.acronis.com/en-gb/blog/>; rel=”canonical”,<http://www.acronis.com/en-gb/blog/>; rel=”shortlink”
X-Frame-Options: SAMEORIGIN
Crawlers could therefore spider endless pages of empty content and see these as valid pages.
Massive sitemaps
The Drupal sitemap module generates a Google XML sitemap containing a maximum of 50,000 URLs – Google’s limit for sitemaps. When that threshold is reached, it creates a second sitemap.
Although it technically works, this became a maintenance nightmare for us. The sitemap files were huge, and took time to regenerate. We were unnecessarily regenerating the entire sitemap when anything changed – new pages, removed pages, and so on. We didn’t literally regenerate sitemaps the moment something changed – but the fact any change on the site meant the sitemap was no longer accurate caused us many headaches.
You can see how Drupal sitemaps work by looking at the Independent sitemap – although for some reason, it doesn’t work correctly – clicking on any of the paginated archive links simply takes you to the same page. Whoops.
How our bespoke CMS handles these issues
We decided to do away with paginated archives anyway, or at least scale them back a bit. For example, here’s the City A.M. Technology archive for September 2016. Date pickers allow you to drill down and find older content if you need to. There’s no pagination running to thousands of pages. However, it seems that we do need to serve a 404 for non-existent pages.
Bogus URLs return 404s. Not much more to say on that one.
As for sitemaps, here’s the City A.M. XML sitemap. We have a few special sitemaps at the top, then a list of year/month sitemaps for our content. Although the list is fairly long, this is much more manageable. The file size of each sitemap is much smaller than the 50,000 URL sitemaps created by Drupal, which means it takes less time for our scripts to read and write to these files – we believe it also allows Google to read the files a bit faster. Most changes happen in the sitemap for the current month, so we refresh that sitemap frequently. If a change happens in a previous month, we update that sitemap. We can regenerate individual sitemaps via a screen in our CMS should we need to do so, and we can also regenerate the whole lot if required.
For a large site with lots of content, we’ve found that our bespoke CMS has helped to solve SEO issues, improve site speed, and reduce server load.