• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Ben Barden

Products - Projects - Process

  • Ben Barden
  • Home
  • Music
  • Weekly blogging in 2021
  • About
You are here: Home / Publishing / SEO and technical issues we encountered in Drupal 7

10th September 2016 by Ben Barden

SEO and technical issues we encountered in Drupal 7

Since I joined City A.M., we’ve been gradually migrating away from Drupal 7 to a fully bespoke CMS built on Laravel. We had planned to do this anyway, but we accelerated parts of the migration as we hit issues that we were unable to solve. Here’s a summary of the main ones.

Issues with paginated archives

We originally had paginated archive pages, which we didn’t consider replacing until quite recently, but at the time we needed to keep the functionality in place. What we noticed was that crawlers were spidering every page in the archive, some 3,000+ pages for some of our categories. This resulted in crawlers hitting every page in our archives one by one – and as most of the pages beyond the first few were uncached, it caused serious load issues on the website whenever we were crawled.

We considered caching the archive pages for longer, but simple paginated archives can’t be cached in this way. If you add one story to the website, everything else gets pushed down one place in the archive – which means every page changes. If you have 25 stories per archive page, as soon as you add 25 new stories, each page is completely different – and you’ll have one new page on the end.

Caching these pages for longer doesn’t solve the problem, it just delays it. Varnish was really saving us during traffic peaks – when I started, an uncached article page took over 4 seconds to load (this included an empty Varnish cache and also an empty Drupal cache for that page). With our Laravel setup, these pages dropped to under 2 seconds for an uncached page, and also resulted in much lower server activity per page load. Caching is a must for any high-traffic site – but we found it was even more important in Drupal.

To be fair, pagination going to thousands of pages isn’t a problem limited to Drupal, and most news sites disable it anyway.

HTTP 200 status returned for bogus URLs

Firstly, consider a URL such as this:

http://www.cityam.com/news

That’s a valid URL, and returns HTTP 200. But what about this?

http://www.cityam.com/news/asdfasdfasdf

It’s an invalid page. But as part of the URL contains a valid route, Drupal 7 returns HTTP 200. Worse still, the canonical URL for the page matches the actual (bogus) URL you’re on, rather than the real URL.

We had various weird URLs in Bing Webmaster Tools, including this:

http://www.cityam.com/news/2013/03/26/london-is-still-the-strongest-financial-centre-in-the-world

Because it contains /news at the start, in Drupal 7 it returned HTTP 200. This URL persisted in Bing despite our attempts to remove it – in fact we found that it started to get indexed with a slight variation:

http://www.cityam.com/news/2013/03/26/london-is-still-the-strongest-financial-centre-in-the-world?page=10

And that got indexed with thousands of different page numbers.

A bug report suggests this issue is fixed in Drupal 8 – but it’s incredible to read the comments there. Either people aren’t fully understanding the impact, or it’s a real pain to fix – but back in 2014, we couldn’t hang around to see if or when this would be fixed.

Having said that, it looks like the Independent found a way to solve it… this URL correctly returns a 404, for instance:

http://www.independent.co.uk/life-style/gadgets-and-tech/news/aaaa

Either way, this issue caused us a massive problem due to crawlers seeing HTTP 200 for some pages that did not exist – causing high load as those pages were often not cached.

Non-existent pages return an HTTP 200 status

On the Acronis blog (powered by Drupal 7) it goes back 40 or so pages. Here’s a weird quirk though. Notice the URL says page 39. However, page 40 is highlighted. Page 1 has no page number; page 2 has ?page=1 in the URL. Despite being fairly consistent, it’s a bit confusing.

So let’s go to page 41 – or page 40 in the URL. Now, what happens if you go to page 999? It loads the page with no content. How about if we load that page with Curl?

$ curl -I http://www.acronis.com/en-gb/blog/posts?page=999
HTTP/1.1 200 OK
Server: nginx
Date: Sat, 10 Sep 2016 12:15:47 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Keep-Alive: timeout=20
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate
X-Content-Type-Options: nosniff
Content-Language: en
X-Frame-Options: SAMEORIGIN
X-Generator: Drupal 7 (http://drupal.org)
Link: <http://www.acronis.com/en-gb/blog/>; rel=”canonical”,<http://www.acronis.com/en-gb/blog/>; rel=”shortlink”
X-Frame-Options: SAMEORIGIN

Crawlers could therefore spider endless pages of empty content and see these as valid pages.

Massive sitemaps

The Drupal sitemap module generates a Google XML sitemap containing a maximum of 50,000 URLs – Google’s limit for sitemaps. When that threshold is reached, it creates a second sitemap.

Although it technically works, this became a maintenance nightmare for us. The sitemap files were huge, and took time to regenerate. We were unnecessarily regenerating the entire sitemap when anything changed – new pages, removed pages, and so on. We didn’t literally regenerate sitemaps the moment something changed – but the fact any change on the site meant the sitemap was no longer accurate caused us many headaches.

You can see how Drupal sitemaps work by looking at the Independent sitemap – although for some reason, it doesn’t work correctly – clicking on any of the paginated archive links simply takes you to the same page. Whoops.

 

How our bespoke CMS handles these issues

We decided to do away with paginated archives anyway, or at least scale them back a bit. For example, here’s the City A.M. Technology archive for September 2016. Date pickers allow you to drill down and find older content if you need to. There’s no pagination running to thousands of pages. However, it seems that we do need to serve a 404 for non-existent pages.

Bogus URLs return 404s. Not much more to say on that one.

As for sitemaps, here’s the City A.M. XML sitemap. We have a few special sitemaps at the top, then a list of year/month sitemaps for our content. Although the list is fairly long, this is much more manageable. The file size of each sitemap is much smaller than the 50,000 URL sitemaps created by Drupal, which means it takes less time for our scripts to read and write to these files – we believe it also allows Google to read the files a bit faster. Most changes happen in the sitemap for the current month, so we refresh that sitemap frequently. If a change happens in a previous month, we update that sitemap. We can regenerate individual sitemaps via a screen in our CMS should we need to do so, and we can also regenerate the whole lot if required.

For a large site with lots of content, we’ve found that our bespoke CMS has helped to solve SEO issues, improve site speed, and reduce server load.

Filed Under: Publishing, Tech Tagged With: bespoke cms, caching, cms, drupal, google xml sitemaps, laravel, pagination, seo

About the author

Web dev, bug finder, writer of niche music
Founder/coder at switchscores.com
Product Manager at Octopus Energy / Kraken Technologies Read More…

Primary Sidebar

benbarden

This was good! Nothing clever but 😁👍 This was good! Nothing clever but 😁👍
Watching Sophie Ellis Bextor on Graham Norton with Watching Sophie Ellis Bextor on Graham Norton with a band wearing animal masks/heads. A very 2020 way to round out a very surreal year. Great music, too! #HappyNewYear2021
Lunch was good today! Lunch was good today!
Guildford Castle, 25th Dec 2020. Merry Christmas. Guildford Castle, 25th Dec 2020. Merry Christmas. 

#loveguildford #guildford #guildfordcastle #christmas
Winter sun. Winter sun.
Sunday walk. Sunday walk.
Sunday afternoon walk. Sunday afternoon walk.
This year's #topnine ... This year's #topnine ...
#glutenfree beers 😁🕺 Oh hey, my latest musi #glutenfree beers 😁🕺

Oh hey, my latest music is now up on Spotify! Link in my profile!
New music coming soon to #spotify and most other o New music coming soon to #spotify and most other online stores. Release date tbc.

This is my 3rd release, containing 2 tracks: November Dawn, and Legacy Layers.

The tracks originally featured as part of my self-produced albums, Symmetry 1 and 2. Each track on Symmetry 1 has a companion track on Symmetry 2. These two are cut from the same cloth and are a kind of "trance guitar" style.

Newly-mastered for 2020, I think the music sounds nice and fresh. Maybe you'll like them!

Slowly building up my discography... you can find me under the name "GFD".
4pm. Must be November. 4pm. Must be November.
The #glutenfree way to have beer and Pringles. 🕺
Just arrive: Thirty years of #thedivinecomedy boxs Just arrive: Thirty years of #thedivinecomedy boxset 😁 this should be good. I've been a fan since 1996, and know the albums well. But I've missed loads of B-sides - and I know there are plenty. This set is a huge collection of music and will take some time to get through fully!!
Coffee time! Coffee time!
Anyone fancy a #glutenfree mince pie? And iced, no Anyone fancy a #glutenfree mince pie? And iced, no less! They're great!
My next music release is coming soon! My next music release is coming soon!
A 25 year epic journey: Way, way back in Septemb A 25 year epic journey:
 
Way, way back in September 1995… I wrote my first proper song, called “Waiting for the Sun”. It’s a bit basic.

Fast-forward to 2005. I had a pile of about 100 unfinished tracks. Tired of never finishing anything, I wanted enough tracks for an album. Over the next few years, I put together five albums, before taking a break in 2009.

In 2016, I decided to start music again under a new artist name – “GFD”. I did three more albums, a remix album, before writing two more albums under my own name. Eleven in total. (Album 10 was called “Ten”, in case anyone lost count.)

So, what now? Well, I’ve had a go at mastering one of my tracks, added some basic artwork, and uploaded it to stores.
The track has been approved, and will hit streaming services in a couple of weeks. It only took 25 years!

A truly epic journey… and the track is quite aptly titled Epic Journey.

Epic Journey by GFD hits Spotify, Apple Music, Amazon Music, Google Play, Tidal, Shazam, and Deezer on September 18th, 2020.

#music #spotify #epicjourney #gfd #instrumentalmusic
New glasses! New glasses!
When Mini Metro gets a bit chaotic. #nintendoswitc When Mini Metro gets a bit chaotic. #nintendoswitch
Gastly, Gastly, everywhere! #pokemongo Gastly, Gastly, everywhere! #pokemongo
Load More… Follow on Instagram

Recent Posts

  • Why personal projects are worth pursuing
  • How recurring calendar reminders help me to get things done
  • Looking back at 2020
  • A 25-year Epic Journey
  • A poor comms strategy – and a plea to do better

Archives

Categories

Handcrafted with on the Genesis Framework